I’ve been looking into self-hosting LLMs, and it seems a $10k GPU is kind of a requirement to run a decently-sized model and get reasonable tokens / s rate. There’s CPU and SSD offloading, but I’d imagine it would be frustratingly slow to use. I even find cloud-based AI like GH Copilot to be rather annoyingly slow. Even so, GH Copilot is like $20 a month per user, and I’d be curious what the actual costs are per user considering the hardware and electricity cost.
What we have now is clearly an experimental first generation of the tech, but the industry is building out data centers as though it’s always going to require massive GPUs / NPUs with wicked quantities of VRAM to run these things. If it really will require huge data centers full of expensive hardware where each user prompt requires minutes of compute time on a $10k GPU, then it can’t possibly be profitable to charge a nominal monthly fee to use this tech, but maybe there are optimizations I’m unaware of.
Even so, if the tech does evolve and it become a lot cheaper to host these things, then will all these new data centers still be needed? On the other hand, if the hardware requirements don’t decrease by an order of magnitude, then will it be cost effective to offer LLMs as a service, in which case, I don’t imagine the new data centers will be needed either.
For $1k more you can have the same thing from nvidia in their dgx spark. You can use high speed fabric to connect two of ‘em and run 405b parameter models, or so they claim.
Point being that’s some pretty big models in the 3-4k range, and massive models for less than 10k. The nvidia one supports comfyui so I assume it supports cuda.
It ain’t cheap and AI has soooo many negatives, but… it does have some positives and local LLMs mitigate some of the minuses, so I hope this helps!
This is not true. I have a single 3090 + 128GB CPU RAM (which wasn’t so expensive that long ago), and I can run GLM 4.6 350B at 6 tokens/sec, with measurably reasonable quantization quality. I can run sparser models like Stepfun 3.5, GLM Air or Minimax 2.1 much faster, and these are all better than the cheapest API models. I can batch Kimi Linear, Seed-OSS, Qwen3, and all sorts of models without any offloading for tons of speed.
…It’s not trivial to set up though. It’s definitely not turnkey. That’s the issue.
You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.
EDIT:
And if I were trying to get local LLMs setup today, for a lot of usage, I’d probably buy an AI Max 395 motherboard instead of a GPU. They aren’t horrendously priced, and they don’t slurp power like a 3090. 96GB VRAM is the perfect size for all those ~250B MoEs.
But if you go AMD, take all the finickiness for an Nvidia setup and multiply it by 10. You better know your way around pip and Linux, as if you don’t get it exactly right, performance will be horrendous, and many setups just won’t work anyway.
Appreciate all the info! I did find this calculator the other day, and it’s pretty clear the RTX 4060 in my server isn’t going to do much though its NVMe may help.
I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM. Maybe when RDNA 5 releases, the RAM situation will have have blown over and we will have much better options for AMD SoCs with strong compute capabilities that aren’t just a 1-trick pony for rasterization or AI.
That calculator is total nonsense. Don’t trust anything like that; at best, its obsolete the week after its posted.
I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores
Yeah, that’s a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine. The CPU itself is incredible for any kind of workstation workload.
along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM
So far, NPUs have been useless. Don’t buy any of that marketing.
I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
That’s still 5 words/second. That’s not a bad reading speed.
Whether its enough? That depends. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.
But anyway, I’m get more like 20 tokens a second with models that aren’t squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it’s even faster.
I’ve been looking into self-hosting LLMs, and it seems a $10k GPU is kind of a requirement to run a decently-sized model and get reasonable tokens / s rate. There’s CPU and SSD offloading, but I’d imagine it would be frustratingly slow to use. I even find cloud-based AI like GH Copilot to be rather annoyingly slow. Even so, GH Copilot is like $20 a month per user, and I’d be curious what the actual costs are per user considering the hardware and electricity cost.
What we have now is clearly an experimental first generation of the tech, but the industry is building out data centers as though it’s always going to require massive GPUs / NPUs with wicked quantities of VRAM to run these things. If it really will require huge data centers full of expensive hardware where each user prompt requires minutes of compute time on a $10k GPU, then it can’t possibly be profitable to charge a nominal monthly fee to use this tech, but maybe there are optimizations I’m unaware of.
Even so, if the tech does evolve and it become a lot cheaper to host these things, then will all these new data centers still be needed? On the other hand, if the hardware requirements don’t decrease by an order of magnitude, then will it be cost effective to offer LLMs as a service, in which case, I don’t imagine the new data centers will be needed either.
Can run decent size models with one of these: https://store.minisforum.com/products/minisforum-ms-s1-max-mini-pc
For $1k more you can have the same thing from nvidia in their dgx spark. You can use high speed fabric to connect two of ‘em and run 405b parameter models, or so they claim.
Point being that’s some pretty big models in the 3-4k range, and massive models for less than 10k. The nvidia one supports comfyui so I assume it supports cuda.
It ain’t cheap and AI has soooo many negatives, but… it does have some positives and local LLMs mitigate some of the minuses, so I hope this helps!
This is not true. I have a single 3090 + 128GB CPU RAM (which wasn’t so expensive that long ago), and I can run GLM 4.6 350B at 6 tokens/sec, with measurably reasonable quantization quality. I can run sparser models like Stepfun 3.5, GLM Air or Minimax 2.1 much faster, and these are all better than the cheapest API models. I can batch Kimi Linear, Seed-OSS, Qwen3, and all sorts of models without any offloading for tons of speed.
…It’s not trivial to set up though. It’s definitely not turnkey. That’s the issue.
You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.
EDIT:
And if I were trying to get local LLMs setup today, for a lot of usage, I’d probably buy an AI Max 395 motherboard instead of a GPU. They aren’t horrendously priced, and they don’t slurp power like a 3090. 96GB VRAM is the perfect size for all those ~250B MoEs.
But if you go AMD, take all the finickiness for an Nvidia setup and multiply it by 10. You better know your way around pip and Linux, as if you don’t get it exactly right, performance will be horrendous, and many setups just won’t work anyway.
Appreciate all the info! I did find this calculator the other day, and it’s pretty clear the RTX 4060 in my server isn’t going to do much though its NVMe may help.
https://apxml.com/tools/vram-calculator
I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM. Maybe when RDNA 5 releases, the RAM situation will have have blown over and we will have much better options for AMD SoCs with strong compute capabilities that aren’t just a 1-trick pony for rasterization or AI.
That calculator is total nonsense. Don’t trust anything like that; at best, its obsolete the week after its posted.
Yeah, that’s a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine. The CPU itself is incredible for any kind of workstation workload.
So far, NPUs have been useless. Don’t buy any of that marketing.
That’s still 5 words/second. That’s not a bad reading speed.
Whether its enough? That depends. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.
But anyway, I’m get more like 20 tokens a second with models that aren’t squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it’s even faster.
If you want to look into the bleeding edge, start with https://github.com/ikawrakow/ik_llama.cpp/
And all the models on huggingface with the ik tag: https://huggingface.co/models?other=ik_llama.cpp&sort=modified
You’ll see instructions for running big models on a 4060 + RAM.
If you’re trying to like batch process documents quickly (so no CPU offloading), look at exl3s instead: https://huggingface.co/models?num_parameters=min%3A12B%2Cmax%3A32B&sort=modified&search=exl3
And run them with this: https://github.com/theroyallab/tabbyAPI
Ah, a lot of good info! Thanks, I’ll look into all of that!