Local inference isn’t really the issue. Relatively low power hardware can already do passable tokens per sec on medium to large size models (40b to 270b). Of course it won’t compare to an AWS Bedrock instance, but it is passable.
The reason why you won’t get local AI systems - at least not completely - is due to the restrictive nature of the best models. Most actually good models are not open source. At best you’ll get a locally runnable GGUF, but not open weights, meaning re-training potential is lost. Not to mention that most of the good and usable solutions tend to have complex interconnected systems so you’re not just talking to an LLM but a series of models chained together.
But that doesn’t mean that local (not hyperlocal, aka “always on your device” but local to your LAN) inference is impossible or hard. I have a £400 node running 3-4b models at lightning speed, at sub-100W (really sub-60W) power usage. For around £1500-2000 you can get a node that gets similar performance with 32-40b models. For about £4000, you can get a node that does the same with 120b models. Mind you I’m talking about lightning fast performance here, not passable.
I hope analog hardware or some other trick will help us in the future to make at least local inference fast and low power.
Local inference isn’t really the issue. Relatively low power hardware can already do passable tokens per sec on medium to large size models (40b to 270b). Of course it won’t compare to an AWS Bedrock instance, but it is passable.
The reason why you won’t get local AI systems - at least not completely - is due to the restrictive nature of the best models. Most actually good models are not open source. At best you’ll get a locally runnable GGUF, but not open weights, meaning re-training potential is lost. Not to mention that most of the good and usable solutions tend to have complex interconnected systems so you’re not just talking to an LLM but a series of models chained together.
But that doesn’t mean that local (not hyperlocal, aka “always on your device” but local to your LAN) inference is impossible or hard. I have a £400 node running 3-4b models at lightning speed, at sub-100W (really sub-60W) power usage. For around £1500-2000 you can get a node that gets similar performance with 32-40b models. For about £4000, you can get a node that does the same with 120b models. Mind you I’m talking about lightning fast performance here, not passable.