• 26 Posts
  • 237 Comments
Joined 2 years ago
cake
Cake day: September 6th, 2023

help-circle
  • I think you’re missing the point or not understanding.

    Let me see if I can clarify

    What you’re talking about is just running a model on consumer hardware with a GUI

    The article talks about running models on consumer hardware. I am making the point that this is not a new concept. The GUI is optional but, as I mentioned, llama.cpp and other open source tools provide an OpenAI-compatible api just like the product described in the article.

    We’ve been running models for a decade like that.

    No. LLMs, as we know them, aren’t that old, were a harder to run and required some coding knowledge and environment setup until 3ish years ago, give or take when these more polished tools started coming out.

    Llama is just a simplified framework for end users using LLMs.

    Ollama matches that description. Llama is a model family from Facebook. Llama.cpp, which is what I was talking about, is an inference and quantization tool suite made for efficient deployment on a variety of hardware including consumer hardware.

    The article is essentially describing a map reduce system over a number of machines for model workloads, meaning it’s batching the token work, distributing it up amongst a cluster, then combining the results into a coherent response.

    Map reduce, in very simplified terms, means spreading out compute work to highly pararelized compute workers. This is, conceptually, how all LLMs are run at scale. You can’t map reduce or parallelize LLMs any more than they already are. The article doent imply map reduce other than taking about using multiple computers.

    They aren’t talking about just running models as you’re describing.

    They don’t talk about how the models are run in the article. But I know a tiny bit about how they’re run. LLMs require very simple and consistent math computations on extremely large matrixes of numbers. The bottleneck is almost always data transfer, not compute. Basically, every LLM deployment tool is already tries to use as much parallelism as possible while reducing data transfer as much as possible.

    The article talks about gpt-oss120, so were aren’t talking about novel approaches to how the data is laid out or how the models are used. We’re talking about tranformer models and how they’re huge and require a lot of data transfer. So, the preference is try to keep your model on the fastest-transfer part of your machine. On consumer hardware, which was the key point of the article, you are best off keeping your model in your GPU’s memory. If you can’t, you’ll run into bottlenecks with PCIe, RAM and network transfer speed. But consumers don’t have GPUs with 63+ GB of VRAM, which is how big GPT-OSS 120b is, so they MUST contend with these speed bottlenecks. This article doesn’t address that. That’s what I’m talking about.


  • This is basically meaningless. You can already run gpt-OSS 120 across consumer grade machines. In fact, I’ve done it with open source software with a proper open source licence, offline, at my house. It’s called llama.cpp and it is one of the most popular projects on GitHub. It’s the basis of ollama which Facebook coopted and is the engine for LMStudio, a popular LLM app.

    The only thing you need is around 64 gigs of free RAM and you can serve gpt-oss120 as an OpenAI-like api endpoint. VRAM is preferred but llama.cpp can run in system RAM or on top of multiple different GPU addressing technologies. It has a built-in server which allows it to pool resources from multiple machines…

    I bet you could even do it over a series of high-ram phones in a network.

    So I ask is this novel or is it an advertisement packaged as a press release?











  • Maybe I do turn on too many things…

    Edit: To be clear, I edit thousands of raw photos per year and do so in bursts of hundreds. I kind of know what I want so it’s wholly possible that I used the wrong plugins. I know that was something I struggled with when I picked it up. There are 5 ways to do the same thing, the devs had a preference, the docs didn’t tell me, but it wasn’t clear what I was “supposed” to use by just using the application. Now… I could have probably gone and read change logs and release notes but that wasn’t the way I was thinking at the time…