Ive had good success on similar hardware (5070 + more ram) with GLM-4.7-Flash, using llama.cpp’s --cpu-moe flag - I can get up to 150k context with it at 20ish tok/sec. I’ve found it to be a lot better for agentic use than GPT-OSS as well, it seems to do a much more in depth reasoning effort, so while it spends more tokens it seems worth it for the end result.
Ive had good success on similar hardware (5070 + more ram) with GLM-4.7-Flash, using llama.cpp’s
--cpu-moeflag - I can get up to 150k context with it at 20ish tok/sec. I’ve found it to be a lot better for agentic use than GPT-OSS as well, it seems to do a much more in depth reasoning effort, so while it spends more tokens it seems worth it for the end result.