this post was submitted on 14 Jun 2024

42 points (100.0% liked)

Selfhosted

39733 readers

487 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

[email protected]

Can a System Handle Brown/Blackouts on only the GPU? (infosec.pub)

submitted 4 months ago by [email protected] to c/[email protected]

33 comments fedilink hide all child comments

I am planning to build a multipurpose home server. It will be a NAS, virtualization host, and have the typical selfhosted services. I want all of these services to have high uptime and be protected from power surges/balckouts, so I will put my server on a UPS.

I also want to run an LLM server on this machine, so I plan to add one or more GPUs and pass them through to a VM. I do not care about high uptime on the LLM server. However, this of course means that I will need a more powerful UPS, which I do not have the space for.

My plan is to get a second power supply to power only the GPUs. I do not want to put this PSU on the UPS. I will turn on the second PSU via an Add2PSU.

In the event of a blackout, this means that the base system will get full power and the GPUs will get power via the PCIe slot, but they will lose the power from the dedicated power plug.

Obviously this will slow down or kill the LLM server, but will this have an effect on the rest of the system?

all 34 comments

sorted by: hot top controversial new old

[–] [email protected] 17 points 4 months ago (3 children)

The amount of absolutely wrong answers in here is astounding.

NO. PCIE is not plug and play. Moreover, having a dead PCIE device that was previously accepting information, and then suddenly stops, is almost guaranteed to cause a kernel panic on any OS because of an overflowing bus of tons of data that can't just sit there waiting. It's a house of cards at that point. It's also going to possibly harm the physical machine when the power comes back on due to a sudden influx of power from an outside PSU powering up a device not meant for such things.

Why wouldn't instead think of maybe NOT running an insane workload on such a machine with insanely power hungry GPUs, and maybe go for an AMD APU instead? Then you'll get all the things you want.

[–] [email protected] 16 points 4 months ago* (last edited 4 months ago) (2 children)

PCIe is absolutely plug and play. Cards have been PnP since the ISA era. You probably meant hot-plug, but it's hot-pluggable too: https://lwn.net/Articles/767885/

Any buffered data will sit in the buffer, and eventually be dropped. Any data sent to the buffer while the buffer is full will be dropped. I'm not intimately familiar with communicating with GPUs, but I imagine the only buffers are in the GPU driver (which would either handle the removal or crash) or in the application (which would probably not handle the removal and just crash). Buffering is not really where I would expect to see a problem.

That said, a GPU disappearing unexpectedly will probably crash your program, if not your whole OS. Physical damage is unlikely, though I definitely wouldn't recommend connecting two PSUs to one system due to the potential for unexpected... well, potential. Inrush current wouldn't really be my concern, since it would be pulling from the external PSU which should have plenty of capacity (and over-current protection too, I would hope). And it's mostly a concern for AC systems, rarely for DC.

[–] [email protected] 1 points 4 months ago (2 children)

What's wrong with 2 PSUs if both of them are connected to the same ground? I thought multiple PSUs is common in the server space too.

[–] [email protected] 3 points 4 months ago

Server PSUs are designed to be identical and work on parallel (though depending on platform, they can be configured as primary/hot spare, too). I'd be concerned about potential difference in power, especially with two non-matching PSUs. It would probably be fine, but not probably enough for me to trust my stuff to it. They're just not designed or tested to operate like that, so they may behave unexpectedly.

[–] [email protected] -3 points 4 months ago (3 children)

You are mistaking "plug and play" with "hot swap/plug CAPABLE". The spec allows for specifically designed hardware to come and go, like Express card, Thunderbolt, or USB4 lane-assigned devices, for example. That's a feature built for a specific type of hardware to tolerate things like accepting current, or having a carrier chip at least communicating with the PCIE bridge that designates it's current status. Almost all of these types of devices are not only designed for this, they are powered by the hardware they are plugged into, allowing that power to be negotiated and controlled by the bridge.

NOT like a giant GPU that requires it's own power supply current and ground.

But hey, you read it on the Internet and seem to think it's possible. Go ahead and try it out with your hardware and see what happens.

[–] [email protected] 4 points 4 months ago (1 children)

Dude.... you're the one that said PCIE isn't plug and play, which is incorrect. Plug and play simply means not having to manually assign IRQ/DMA/etc before using the peripheral, instead being handled automatically by the system/OS, as well as having peripherals identify themselves allowing the OS to automatically assign drivers. PCIE is fully plug-and-play compatible via ACPI, and hot swapping is supported by the protocol, if the peripheral also supports it.

[–] [email protected] -5 points 4 months ago (1 children)

Again...it is not. You can't just go and unplug swap anything anywhere into a PCIE slot. The protocol supports it, but it is not by any definition any sort of live swappable by default.

My speedometer says 200, but my car does not go that fast.

An egg isn't an omelet.

The statement "humans can fly" is technically true, but not without a plane.

A device that supports hot swap into a compatible and specifically configured slot could be though.

I can keep going forever with this.

[–] [email protected] 1 points 4 months ago (1 children)

Are you slow? nobody is arguing that you can hot swap a GPU. That's not what people are correcting you on.

YOU claimed that PCIE is not PLUG AND PLAY

NO. PCIE is not plug and play.

That was your comment. It was wrong. You were wrong.

[–] [email protected] -2 points 4 months ago

And it still is not.

[–] [email protected] 1 points 4 months ago (1 children)

Right, it requires device support. And most GPUs won't support it. But it's by no means impossible.

I've got some junk hardware at work, I'll try next time I'm in and let you know.

[–] [email protected] -3 points 4 months ago

You have multiple accounts, and are sadly so consumed with Internet points, you used both of them to downvote when you're won't. You're pathetic. Get a hobby. Maybe learning about hardware!

[–] [email protected] 5 points 4 months ago

I do something similar to op, however, running llms is what finally convinced me to switch over to kubernetes for these exact reasons, I needed the ability to have gpus running on separate nodes that then I could toggle on or off. Power concerns here are real, the only real solution is to separate your storage and your compute nodes.

What OP is suggesting is not only not going to work, and cause damage probably to the motherboard and gpus, but I would assume is also a pretty large fire hazard. One GPU takes in an insane amount of power, two gpus is not something to sneeze at. It's worth the investment of getting a very good power supply and not cheaping out on any components.

[–] [email protected] 1 points 4 months ago

You're forgetting that the card would still be receiving it's 75W of power from the PCIe bus. This is what powers cards that don't have extra power connectors.

[–] [email protected] 11 points 4 months ago (1 children)

This is a terrible idea, no really.

Any system that shares power and grounds (i.e. on the same bus), keep on the same power supply/domain.

Even, if!!!!, it doesn't fry your computer when one power system goes off but the other stays on - the system will absolutely not be stable, and will behave in unexpected ways.

DO NOT DO THIS.

[–] [email protected] 3 points 4 months ago

There computer:

gif

[–] [email protected] 10 points 4 months ago (1 children)

I think the safe option would be to use a smart UPS and Network UPS Tools to shutdown the LLM virtual machine when it's running on battery. I do something similar with my NAS as it's running on an older dell R510 so when the UPS goes onto battery it'll safely shut down that whole machine to extend how long my networking gear will stay powered.

[–] [email protected] 1 points 4 months ago (1 children)

I've wanted to implement something like that with my 1920R UPS for my rack but haven't found the time to commit to antiquated hardware.

Was enough of a hassle dealing with the expired SSL certs on the management card yet getting software running on one of my machines to communicate with the UPS.

All things considered my two servers chilling chew around 60w on average, not taking into account my POE cameras or other devices. The UPS should run for over a day without getting close to draining its batteries (have a half populated ebm too).

[–] [email protected] 2 points 4 months ago (1 children)

wanted to implement something like that with my 1920R UPS for my rack but haven't found the time to commit to antiquated hardware.

Was enough of a hassle dealing with the expired SSL certs on the management card yet getting software running on one of my machines to communicate with the UPS.

Honestly you should just bypass dells management software and use NUT. It supports your UPS's management card if you enable SNMP or you can bypass it all together and just run off of the USB/serial.

All things considered my two servers chilling chew around 60w on average, not taking into account my POE cameras or other devices. The UPS should run for over a day without getting close to draining its batteries (have a half populated ebm too).

I'm pretty surprised I can run my whole network for an hour off of my 1500va UPS with three switches and a handful of POE devices. I'm still thinking about replacing it with a rack mount unit so i can lock it inside my rack as I've been having issues with unauthorized people messing with it.

[–] [email protected] 1 points 4 months ago

Yeah NUT is the package I've been looking at, and looks decently integrated into NixOS just, getting around to configuring it is another time sink.

My 1920R and an unused 15A Dell rackmount (h967n maybe) came with my rack, I've got no reason to have two UPS running nor do I want to replace the batteries in another UPS or have a 15A socket installed in the house just yet. But man it's tempting to piggy-back it off the 1920R for shits and giggles.

Waiting on some parts to arrive from AliExpress - once they arrive I'll be able to decommission one of my servers and have all my services running off one board.

[–] [email protected] 7 points 4 months ago

What you want are two servers, one for each purpose. What you are proposing is very janky and will compromise the reliability of your services.

[–] [email protected] 7 points 4 months ago

You could accomplish what you're trying by putting the GPU in a second computer. Further, most UPSes have a data interface, so that you could have the GPU computer plugged into the UPS too, but receive the signal when power is out, so it can save its work and shutdown quickly preserving power in the UPS batteries. The only concern there would be the max current output of the UPS in the event of a power outage being able to power both computers for a short time.

[–] [email protected] 6 points 4 months ago

It looks like regular PSUs are isolated from the mains ground with a transformer. That means that two PSUs’ DC grounds will not be connected. That will likely cause problems for you, as they’ll have to back flow current in places that do NOT expect back flow current to account for the voltage differences between the two ground potentials. Hence it might damage the GPU which is going be the mediator between these two PSUs - and maybe the mobo if everything goes to shit.

Now I am not saying this will be safe, but you may avoid that issue by tying the grounds of the two PSUs together. You still have the issue where if, say, PSU1’s 12V voltage plane meets PSU2’s 12V voltage plane and they’re inevitably not the same exact voltage, you’ll have back flowing current again which is bad because again nothing is designed for that situation. Kind of like if you pair lithium batteries in parallel that aren’t matched, the higher voltage one will back charge the other and they’ll explode.

[–] [email protected] 4 points 4 months ago

It sounds like a fire hazard. Do not mix power supplies especially if only one is plugged into a UPS

[–] [email protected] 2 points 4 months ago

The proper way of doing this is to have two separate systems in a cluster such as proxmox. The system with GPUs runs certain workloads and the non GPU system runs other workloads.

Each system can be connected (or not) to a ups and shut down with a power outage and then boot back up when power is back.

Don't try hot-plugging a gpu, it will never be reliable.

Run a proxmox cluster or kubernetes cluster, it is designed for this type of application but will add a fair amount of complexity.

[–] [email protected] 2 points 4 months ago

I imagine this would be up to the application. What you’re describing would been seen by the OS as the device becoming unavailable. That won’t really affect the OS. But, it could cause problems with the drivers and/or applications that are expecting the device to be available. The effect could range from “hm, the GPU isn’t responding, oh well” to a kernel panic.

[–] [email protected] 1 points 4 months ago

Most UPS systems of quality will come with software capabilities. You can leverage this and just use a daemon to check the charge status every minute or so. If it's ever off AC or reporting charge levels lowering, you can toss the system into a low power profile. This might accomplish what you're trying to do.

[–] [email protected] 1 points 4 months ago (1 children)

Nope. I actually did that unintentionally on a PC I built. I only used one power wire when the GPU needed 2 so it couldn't use all the power it needed when running 100%. My understanding was PCI doesn't support disconnecting devices so the system expects all components it starts up with to be available all the time. Lose one and the system goes down.

[–] [email protected] 6 points 4 months ago* (last edited 4 months ago) (3 children)

PCIe absolutely does support disconnecting devices. It is a hot swap bus, that's how ExpressCard works. But it doesn't mean that the board/uefi implements it correctly.

[–] [email protected] 3 points 4 months ago

in other words: OP either needs to get a thunderbolt dock or straight up have 2 computers. The latter should not even consume that much more power if the PC gets shut down in the evening and woken up using wakeonlan in the morning.

[–] [email protected] 1 points 4 months ago

Oh nice! I knew hot swapping was supported on many other devices but not PCIe itself. Feels wrong to rip a card out while the system is powered up.

[–] [email protected] 1 points 4 months ago

Also some GPUs support running without the external power connectors/not all of them. My old GTX 1080 ran for about 3 months off of just the PCIe slots power because I forgot to plug them in. Newer GPUs are FAR more power hungry though and not all newer cards support that. Plus I've never tried yoinking the power cables while it's on. That can't be good.

[–] [email protected] 1 points 4 months ago* (last edited 4 months ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters	More Letters
NAS	Network-Attached Storage
PCIe	Peripheral Component Interconnect Express
PSU	Power Supply Unit
PoE	Power over Ethernet
SSL	Secure Sockets Layer, for transparent encryption

5 acronyms in this thread; the most compressed thread commented on today has 4 acronyms.

[Thread #806 for this sub, first seen 15th Jun 2024, 14:15] [FAQ] [Full list] [Contact] [Source code]