this post was submitted on 09 Apr 2025

38 points (80.6% liked)

Selfhosted

47350 readers

1 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

[email protected]

How to use GPUs over multiple computers for local AI? (lemmy.dbzer0.com)

submitted 1 month ago by [email protected] to c/[email protected]

64 comments fedilink hide all child comments

The problem is simple: consumer motherboards don't have that many PCIe slots, and consumer CPUs don't have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.

My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.

Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.

Thanks

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 33 points 1 month ago (2 children)

A 10 Gbps network is MUCH slower than even the smallest oldest PCIe slot you have. So cramming the GPUs in any old slot that'll fit is a much better option than distributing it over multiple PCs.

[–] [email protected] 12 points 1 month ago (1 children)

I agree with the idea of not using a 10 Gbps network for GPU work. Just one small nitpick: PCIe Gen 1 in an x1 slot is only capable of 2.5 GTransfers/sec, which translates to about 2 GBits/sec, making it about 5x slower than a 10 Gbps line-rate network.

I sincerely hope OP is not running modern AI work on a mobo with only Gen 1...

[–] [email protected] 3 points 1 month ago

Thanks for the comment. I don't want to use a networked distributed cluster for AI if I can help it. I'm looking at other options and maybe I'll find something

[–] [email protected] 2 points 1 month ago

Your point is valid. Originally I was looking for deals on cheap CPU + Motherboard combos that will offer me a lot of PCIe and won't be very expensive, but I couldn't find anything good for EPYC. I am now looking for used supermicro motherboards and maybe I can get something I like. I don't want to do networking for this project either but it was the only idea I could think of a few hours back

[–] [email protected] 13 points 1 month ago (1 children)

There are several solutions:

https://github.com/b4rtaz/distributed-llama

https://github.com/exo-explore/exo

https://github.com/kalavai-net/kalavai-client

https://petals.dev/

Didn't try any of them and haven't looked for 6 months, so maybe something better have arrived..

[–] [email protected] 4 points 1 month ago (1 children)

Thank you for the links. I will go through them

[–] [email protected] 2 points 1 month ago (1 children)

I've tried Exo and it worked fairly well for me. Combined my 7900 XTX, GTX 1070, and M2 MacBook Pro.

[–] [email protected] 2 points 1 month ago

+1 on exo, worked for me across the 7900xtx, 6800xt, and 1070ti

[–] [email protected] 9 points 1 month ago (1 children)

consumer motherboards don’t have that many PCIe slots

The number of PCIe slots isn't the most limiting factor when it comes to consumer motherboards. It's the number of PCIe lanes that are supported by your CPU and the motherboard has access to.

It's difficult to find non-server focused hardware that can do something like this because you need a significant number of PCIe lanes to accommodate your CPU, and your several GPUs at full speed. Using an M.2 SSD? Even more difficult.

Your 1 GPU per machine is a decent approach. Using a Kubernetes cluster with device plugins is likely the best way to accomplish what you want here. It would involve setting up your cluster, installing the drivers for your GPU (on each node) which then exposes the device to the system. Then when you create your Ollama container, in the prestart hook, ensure you expose your GPUs to the container for usage.

The issue with doing this, is 10Gbe is very slow compared to your GPU via PCIe. You're networking all these GPUs to do some cool stuff, but then you're severely bottle-necking yourself with your network. All in all, it's not a very good plan.

[–] [email protected] 2 points 1 month ago

I agree with your assessment. I was indeed going to run k8s, just hadn't figured out what you told me. Thanks for that.

And yes, I realised that 10Gbe is just not enough for this stuff. But another commenter told me to look for used threadripper and EPYC boards (which are extremely expensive for me), which gave me the idea to look for older Intel CPU+Motherboard combos. Maybe I'll have some luck there. I was going to use Talos in a VM with all the GPUs passed through to it.

[–] [email protected] 9 points 1 month ago* (last edited 1 month ago) (1 children)

Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.

Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.

[–] [email protected] 2 points 1 month ago (5 children)

Aren't Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.

Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (1 children)

Prior-gen Epyc boards show up on eBay from time to time, often as CPU+mobo bundles from Chinese datacenters that are upgrading to latest gen. These can be had for a deal, if they're still available, and would provide PCIe lanes for days.

[–] [email protected] 3 points 1 month ago* (last edited 1 month ago) (2 children)

Yeah, adding to your post, Threadripper also has lots of PCIe lanes. Here is one that has 4 x16 slots. And, note, I am not endorsing that specific listing. I did very minimal research on that listing, just using it as an example.

Edit: Marauding_gibberish, if you need/want AM5: x670E motherboards have a good number of PCIe lanes and can be bought used now (x870E are newest gen AM5 with lots of lanes as well, but both pale compared to what you can get with Epyc or Threadripper).

[–] [email protected] 2 points 1 month ago

Thanks for the tip on x670, I'll take a look

[–] [email protected] 2 points 1 month ago (1 children)

I see. I must be doing something wrong because the only ones I found were over $1000 on eBay. Do you have any tips/favoured listings?

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago)

All I did for that one was search "Threadripper" and look at the pictures for ones with 4x x16 slots that were not hella expensive. There are technically filters for that, but, I don't trust people to list their things correctly.

For which chipsets, ect to look for, check out this page. If you click on Learn More next to AM5 for example, it tells you how many PCIe lanes are on each chipset type which can give you some initial search criteria to look for. (That is what made me point out x670E as it has the most lanes, but is not newest gen, so you can find used versions.)

[–] [email protected] 3 points 1 month ago* (last edited 1 month ago) (1 children)

Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you'd like. Just let me know!

E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don't think it's sold anymore other than really high prices on eBay.

[–] [email protected] 2 points 1 month ago (1 children)

Thank you, and that highlights the problem - I don't see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn't going to be a thing for much longer with tariffs, so I'm looking elsewhere

[–] [email protected] 4 points 1 month ago

Yes, I inadvertently emphasized your challenge :-/

load more comments (3 replies)

[–] [email protected] 6 points 1 month ago (2 children)

Distributed llama

[–] [email protected] 2 points 1 month ago

That looks interesting. Might have to check it out.

[–] [email protected] 1 points 1 month ago

Thank you, I'll take a look

[–] [email protected] 5 points 1 month ago (1 children)

Maybe you want something like a Beowulf Cluster?

[–] [email protected] 2 points 1 month ago (2 children)

Never heard of it. What is it about?

[–] [email protected] 3 points 1 month ago (1 children)

It's a way to do distributed parallel computing using consumer-grade hardware. I don't actually know a ton about them, so you'd be better served by looking up information about them.

https://en.wikipedia.org/wiki/Beowulf_cluster

load more comments (1 replies)

[–] [email protected] 5 points 1 month ago (2 children)

You're entering the realm of enterprise AI horizontal scaling which is $$$$

[–] [email protected] 2 points 1 month ago (4 children)

I'm not going to do anything enterprise. I'm not sure how people seem to think of it this way when I didn't even mention it.

I plan to use 4 GPUs with 16-24GB VRAM each to run smaller 24B models.

[–] [email protected] 6 points 1 month ago (1 children)

I didn't say you were, I said you were asking about a topic that enters that area.

[–] [email protected] 2 points 1 month ago

I see. Thanks

[–] [email protected] 4 points 1 month ago (1 children)

well that looks like small enterprise scale

[–] [email protected] 1 points 1 month ago (1 children)

If you consider 4 B580s as enterprise, sure I guess

load more comments (1 replies)

load more comments (2 replies)

load more comments (1 replies)

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (5 children)

If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list.
https://github.com/SchedMD/slurm
https://slurm.schedmd.com/

load more comments (5 replies)

[–] [email protected] 4 points 1 month ago (1 children)

Ignorant here. Would mining rigs work for this?

[–] [email protected] 2 points 1 month ago

I think yes

[–] [email protected] 1 points 1 month ago (1 children)

I know nothing technical to help you. But this guy’s YouTube video goes over random shit about using different computers. I believe he uses thunderbolt 4 to connect the systems, though. Plenty of other material on YouTube, as well.

load more comments (1 replies)

[–] [email protected] 1 points 1 month ago

Sure, works fine for inference with tensor parallelism, USB4 / thunderbolt 4/5 is a better (40Gbit+ and already there) bet than ethernet (see distributed-llama). Trash for training / fine tuning, that needs higher inter GPU speed, or better a bigger GPU VRAM.

load more comments