this post was submitted on 09 Apr 2025
38 points (80.6% liked)

Selfhosted

45677 readers
158 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

The problem is simple: consumer motherboards don't have that many PCIe slots, and consumer CPUs don't have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.

My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.

Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.

Thanks

top 50 comments
sorted by: hot top controversial new old
[–] [email protected] 33 points 1 week ago (2 children)

A 10 Gbps network is MUCH slower than even the smallest oldest PCIe slot you have. So cramming the GPUs in any old slot that'll fit is a much better option than distributing it over multiple PCs.

[–] [email protected] 12 points 1 week ago (1 children)

I agree with the idea of not using a 10 Gbps network for GPU work. Just one small nitpick: PCIe Gen 1 in an x1 slot is only capable of 2.5 GTransfers/sec, which translates to about 2 GBits/sec, making it about 5x slower than a 10 Gbps line-rate network.

I sincerely hope OP is not running modern AI work on a mobo with only Gen 1...

[–] [email protected] 3 points 1 week ago

Thanks for the comment. I don't want to use a networked distributed cluster for AI if I can help it. I'm looking at other options and maybe I'll find something

[–] [email protected] 2 points 1 week ago

Your point is valid. Originally I was looking for deals on cheap CPU + Motherboard combos that will offer me a lot of PCIe and won't be very expensive, but I couldn't find anything good for EPYC. I am now looking for used supermicro motherboards and maybe I can get something I like. I don't want to do networking for this project either but it was the only idea I could think of a few hours back

[–] [email protected] 13 points 1 week ago (1 children)

There are several solutions:

https://github.com/b4rtaz/distributed-llama

https://github.com/exo-explore/exo

https://github.com/kalavai-net/kalavai-client

https://petals.dev/

Didn't try any of them and haven't looked for 6 months, so maybe something better have arrived..

[–] [email protected] 4 points 1 week ago (1 children)

Thank you for the links. I will go through them

[–] [email protected] 2 points 1 week ago (1 children)

I've tried Exo and it worked fairly well for me. Combined my 7900 XTX, GTX 1070, and M2 MacBook Pro.

[–] [email protected] 2 points 1 week ago

+1 on exo, worked for me across the 7900xtx, 6800xt, and 1070ti

[–] [email protected] 9 points 1 week ago (1 children)

consumer motherboards don’t have that many PCIe slots

The number of PCIe slots isn't the most limiting factor when it comes to consumer motherboards. It's the number of PCIe lanes that are supported by your CPU and the motherboard has access to.

It's difficult to find non-server focused hardware that can do something like this because you need a significant number of PCIe lanes to accommodate your CPU, and your several GPUs at full speed. Using an M.2 SSD? Even more difficult.

Your 1 GPU per machine is a decent approach. Using a Kubernetes cluster with device plugins is likely the best way to accomplish what you want here. It would involve setting up your cluster, installing the drivers for your GPU (on each node) which then exposes the device to the system. Then when you create your Ollama container, in the prestart hook, ensure you expose your GPUs to the container for usage.

The issue with doing this, is 10Gbe is very slow compared to your GPU via PCIe. You're networking all these GPUs to do some cool stuff, but then you're severely bottle-necking yourself with your network. All in all, it's not a very good plan.

[–] [email protected] 2 points 1 week ago

I agree with your assessment. I was indeed going to run k8s, just hadn't figured out what you told me. Thanks for that.

And yes, I realised that 10Gbe is just not enough for this stuff. But another commenter told me to look for used threadripper and EPYC boards (which are extremely expensive for me), which gave me the idea to look for older Intel CPU+Motherboard combos. Maybe I'll have some luck there. I was going to use Talos in a VM with all the GPUs passed through to it.

[–] [email protected] 9 points 1 week ago* (last edited 1 week ago) (1 children)

Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.

Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.

[–] [email protected] 2 points 1 week ago (5 children)

Aren't Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.

Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.

[–] [email protected] 4 points 1 week ago* (last edited 1 week ago) (1 children)

Prior-gen Epyc boards show up on eBay from time to time, often as CPU+mobo bundles from Chinese datacenters that are upgrading to latest gen. These can be had for a deal, if they're still available, and would provide PCIe lanes for days.

[–] [email protected] 3 points 1 week ago* (last edited 1 week ago) (2 children)

Yeah, adding to your post, Threadripper also has lots of PCIe lanes. Here is one that has 4 x16 slots. And, note, I am not endorsing that specific listing. I did very minimal research on that listing, just using it as an example.

Edit: Marauding_gibberish, if you need/want AM5: x670E motherboards have a good number of PCIe lanes and can be bought used now (x870E are newest gen AM5 with lots of lanes as well, but both pale compared to what you can get with Epyc or Threadripper).

[–] [email protected] 2 points 1 week ago

Thanks for the tip on x670, I'll take a look

[–] [email protected] 2 points 1 week ago (1 children)

I see. I must be doing something wrong because the only ones I found were over $1000 on eBay. Do you have any tips/favoured listings?

[–] [email protected] 1 points 1 week ago* (last edited 1 week ago)

All I did for that one was search "Threadripper" and look at the pictures for ones with 4x x16 slots that were not hella expensive. There are technically filters for that, but, I don't trust people to list their things correctly.

For which chipsets, ect to look for, check out this page. If you click on Learn More next to AM5 for example, it tells you how many PCIe lanes are on each chipset type which can give you some initial search criteria to look for. (That is what made me point out x670E as it has the most lanes, but is not newest gen, so you can find used versions.)

[–] [email protected] 3 points 1 week ago* (last edited 1 week ago) (1 children)

Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you'd like. Just let me know!

E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don't think it's sold anymore other than really high prices on eBay.

[–] [email protected] 2 points 1 week ago (1 children)

Thank you, and that highlights the problem - I don't see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn't going to be a thing for much longer with tariffs, so I'm looking elsewhere

[–] [email protected] 4 points 1 week ago

Yes, I inadvertently emphasized your challenge :-/

load more comments (3 replies)
[–] [email protected] 6 points 1 week ago (2 children)
[–] [email protected] 2 points 1 week ago

That looks interesting. Might have to check it out.

[–] [email protected] 1 points 1 week ago

Thank you, I'll take a look

[–] [email protected] 5 points 1 week ago (1 children)

Maybe you want something like a Beowulf Cluster?

[–] [email protected] 2 points 1 week ago (2 children)

Never heard of it. What is it about?

[–] [email protected] 3 points 1 week ago (1 children)

It's a way to do distributed parallel computing using consumer-grade hardware. I don't actually know a ton about them, so you'd be better served by looking up information about them.

https://en.wikipedia.org/wiki/Beowulf_cluster

load more comments (1 replies)
load more comments (1 replies)
[–] [email protected] 5 points 1 week ago (2 children)

You're entering the realm of enterprise AI horizontal scaling which is $$$$

[–] [email protected] 2 points 1 week ago (4 children)

I'm not going to do anything enterprise. I'm not sure how people seem to think of it this way when I didn't even mention it.

I plan to use 4 GPUs with 16-24GB VRAM each to run smaller 24B models.

[–] [email protected] 6 points 1 week ago (1 children)

I didn't say you were, I said you were asking about a topic that enters that area.

[–] [email protected] 2 points 1 week ago

I see. Thanks

[–] [email protected] 4 points 1 week ago (1 children)

well that looks like small enterprise scale

[–] [email protected] 1 points 1 week ago (1 children)

If you consider 4 B580s as enterprise, sure I guess

load more comments (1 replies)
load more comments (2 replies)
load more comments (1 replies)
[–] [email protected] 4 points 1 week ago* (last edited 1 week ago) (5 children)

If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list.
https://github.com/SchedMD/slurm
https://slurm.schedmd.com/

load more comments (5 replies)
[–] [email protected] 4 points 1 week ago (1 children)

Ignorant here. Would mining rigs work for this?

[–] [email protected] 2 points 1 week ago

I think yes

[–] [email protected] 1 points 1 week ago (1 children)

I know nothing technical to help you. But this guy’s YouTube video goes over random shit about using different computers. I believe he uses thunderbolt 4 to connect the systems, though. Plenty of other material on YouTube, as well.

load more comments (1 replies)
[–] [email protected] 1 points 1 week ago

Sure, works fine for inference with tensor parallelism, USB4 / thunderbolt 4/5 is a better (40Gbit+ and already there) bet than ethernet (see distributed-llama). Trash for training / fine tuning, that needs higher inter GPU speed, or better a bigger GPU VRAM.

load more comments
view more: next ›