this post was submitted on 02 Apr 2025
47 points (88.5% liked)

Selfhosted

45411 readers
387 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
47
Consumer GPUs to run LLMs (lemmy.dbzer0.com)
submitted 2 days ago* (last edited 2 days ago) by [email protected] to c/[email protected]
 

Not sure if this is the right place, if not please let me know.

GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.

Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).

What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.

My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.

Thanks


You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 1 day ago (1 children)

I got it working with my 6800XT. I'm running deep seek r1 14b (somewhere around there) and the deep seek coder V2. I have a link to a blog with those instructions

https://gotosocial.michaeldileo.org/@mdileo/statuses/01JQA4M4Q33PMCADH9M2AWQSS8

[–] [email protected] 1 points 1 day ago (1 children)

Thank you. Are 14B models the biggest you can run comfortably?

[–] [email protected] 2 points 1 day ago (1 children)

The coder model has only that one. The ones bigger than that are like 20GB+, and my GPU has 16GB. I've only tried two models, but it looked like the size balloons after that, so that may be the biggest models that I can run.

[–] [email protected] 1 points 1 day ago (1 children)

Do you have any recommendations for running the Mistral small model? I'm very interested in it alongside CodeLlama, OogaBooga and others

[–] [email protected] 0 points 1 day ago (1 children)

I haven't tried those, so not really, but with open web UI, you can download and run anything, just make sure it fits in your vram so it doesn't run on the CPU. The deep seek one is decent. I find that i like chatgpt 4-o better, but it's still good.

[–] [email protected] 1 points 1 day ago (1 children)

In general how much VRAM do I need for 14B and 24B models?

[–] [email protected] 2 points 1 day ago (1 children)

It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).

[–] [email protected] 1 points 1 day ago

Oh and I typically get 16-20 tok/s running a 32b model on Ollama using Open WebUI. Also I have experienced issues with 4-bit quantization for the K/V cache on some models myself so just FYI