Pumpkin Escobar

Pumpkin Escobar@lemmy.world · edit-2 6 days ago

First a caveat/warning - you’ll need a beefy GPU to run larger models, there are some smaller models that perform pretty well.

Adding a medium amount of extra information for you or anyone else that might want to get into running models locally

Tools

Ollama - great app for downloading/managing/running models locally
OpenWebUI - A web app that provides a UI like the ChatGPT web app, but can use local models
continue.dev - A VS Code extension that can use ollama to give a github copilot-like AI assistant running against a local model (can also connect to Anthropic Claude, etc…)

Models

If you look at https://ollama.com/library?sort=featured you can see models

Model size is measured by parameter count. Generally higher parameter models are better (more “smart”, more accurate) but it’s very challenging/slow to run anything over 25b parameters on consumer GPUs. I tend to find 8-13b parameter models are a sort of sweet spot, the 1-4b parameter models are meant more for really low power devices, they’ll give you OK results for simple requests and summarizing, but they’re not going to wow you.

If you look at the ‘tags’ for the models listed below, you’ll see things like 8b-instruct-q8_0 or 8b-instruct-q4_0. The q part refers to quantization, or shrinking/compressing a model and the number after that is roughly how aggressively it was compressed. Note the size of each tag and how the size reduces as the quantization gets more aggressive (smaller numbers). You can roughly think of this size number as “how much video ram do I need to run this model”. For me, I try to aim for q8 models, fp16 if they can run in my GPU. I wouldn’t try to use anything below q4 quantization, there seems to be a lot of quality loss below q4. Models can run partially or even fully on a CPU but that’s much slower. Ollama doesn’t yet support these new NPUs found in new laptops/processors, but work is happening there.

Llama 3.1 - The 8b instruct model is pretty good, decent speed and good quality. This is a good “default” model to use
Llama 3.2 - This model was just released yesterday. I’m only seeing the 1b and 3b models right now. They’ve changed the 8b model to 11b, I’m assuming the 11b model is going to be my new goto when it’s available.
Deepseek Coder v2 - A great coding assistant model
Command-r - This is a more niche model, mainly useful for RAG. It’s only available in a 35b parameter model, so not all that feasible to run locally
Mistral small - A really good model, in the ballpark of Llama. I haven’t had quite as much luck with this as with Llama but it is good and I just saw that a new version was released 8 days ago, will need to check it out again

Pumpkin Escobar@lemmy.world · 6 days ago

It’s a good thing that real open source models are getting good enough to compete with or exceed OpenAI.

Pumpkin Escobar@lemmy.world · 11 days ago

I’ll preface by saying I think LLMs are useful and in the next couple years there will be some interesting new uses and existing ones getting streamlined…

But they’re just next word predictors. The best you could say about intelligence is that they have an impressive ability to encode knowledge in a pretty efficient way (the storage density, not the execution of the LLM), but there’s no logic or reasoning in their execution or interaction with them. It’s one of the reasons they’re so terrible at math.

Pumpkin Escobar@lemmy.world · 16 days ago

Coming from c# then typescript and nextjs, rye feels very intuitive and like a nice bridge / gateway drug into python.

Pumpkin Escobar@lemmy.world · 1 month ago

Shoot your shot, player.

Don’t go crazy or over the top, don’t overdo it, but just say it. If they’re a good friend they won’t be scared away. If they’re like you that way you’ll both be happier.

Don’t overthink it, ask them if they’d ever like to hang out or do something more like a date.

Ballsy, direct, badass. That can be you.

Dating is awkward but life gets a lot better once you get more comfortable with it. Everyone is a dating idiot until they’re not, there’s a good chance your friend is still in the idiot stage and maybe hell be over the moon that you helped push through it.

Pumpkin Escobar@lemmy.world · 1 month ago

Just throw a “I’m Kamala Harris and I approve this message” on the end of that, 1 commercial, in the bag. Next.

Pumpkin Escobar@lemmy.world · 1 month ago

Really love arch and the AUR. I’ve been tempted to get nix set up for the rare cases when there’s no AUR package or the AUR package is unmaintained. I figure if there’s no package in the AUR or nixpkgs, it’s probably not worth running.

Pumpkin Escobar@lemmy.world · 2 months ago

Pumpkin Escobar@lemmy.world · 2 months ago

btop reports some gpu, network and disk information that I don’t think shows up in htop, feels a bit more comprehensive maybe? Both are fine, but I too use btop, it’s nice.

Random trivia: I think btop has been rewritten like 3-5 times now? It’s sort of an inside joke to the point that someone suggested another rewrite from C++ to Rust ( https://github.com/aristocratos/btop/issues/5 ). I guess the guy just likes writing system monitoring console apps.

screenshot

Pumpkin Escobar@lemmy.world · 2 months ago

Pumpkin Escobar@lemmy.world · 2 months ago

Easiest shorting money I ever made.

Pumpkin Escobar@lemmy.world · 2 months ago

It’s not uncommon on sensitive stories like this for the government to loop-in journalists ahead of time so they can pull together background and research with an agreed-upon embargo until some point in the future.

This wasn’t the US government telling the newspaper they couldn’t report on a story they had uncovered from their own investigation.

Pumpkin Escobar@lemmy.world · 2 months ago

There’s quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

There’s also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc… It’s a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

Then there are some neat projects to distribute models across multiple computers like exo and petals. They’re more targeted at a p2p-style random collection of computers. I’ve run petals in a small cluster and it works reasonably well.

Pumpkin Escobar@lemmy.world · 3 months ago

Is this the new “Simpsons already did it”?

Cunk already did it…

(3:40 if you want to get right to it) https://www.youtube.com/watch?v=UoSUx1xyj1E

Pumpkin Escobar@lemmy.world · 3 months ago

MAWP - Archer

Pumpkin Escobar@lemmy.world · edit-2 3 months ago

Yale z-wave work well and last a long time between needing to replace batteries, and can run off of rechargeables. Can add to home assistant and work with Siri and Alexa integrations on home assistant.

Had some Schlage locks that ran through batteries way too fast.

Pumpkin Escobar@lemmy.world · edit-2 3 months ago

When they’re not recording your desktop in an unencrypted database for AI, boot-looping your computer with bad patches or showing ads in your start menu, they’re disabling your account for calling family to see if they’re still alive. Damn.

Pumpkin Escobar@lemmy.world · 3 months ago

Taking ollama for instance, either the whole model runs in vram and compute is done on the gpu, or it runs in system ram and compute is done on the cpu. Running models on CPU is horribly slow. You won’t want to do it for large models

LM studio and others allow you to run part of the model on GPU and part on CPU, splitting memory requirements but still pretty slow.

Even the smaller 7B parameter models run pretty slow in CPU and the huge models are orders of magnitude slower

So technically more system ram will let you run some larger models but you will quickly figure out you just don’t want to do it.

Pumpkin Escobar@lemmy.world · 3 months ago

Boeing made $76B in revenue in 2023. This is slightly more than 1 day’s revenue for them ($210M / day) or a bit more than 10 days profit for them ($21M / day). They will keep doing what they’re doing, but increase their spending on a PR campaign to improve their public image.

Pumpkin Escobar@lemmy.world · 3 months ago

Also, the few points others are talking about needing others, there’s a group-finder and I’d say most people running those raids in group finder groups don’t talk at all, so you can just pretend they’re NPCs if you want.