this post was submitted on 06 Feb 2024
45 points (92.5% liked)

Linux

48245 readers
510 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

I have Arch Linux on Ryzen 7 3700X, 32gb of ram, and some Gigabyte motherboard with updated bios.

Few weeks ago my computer would startet crashing (screen would freeze) soon after login or even at boot about 50% of the time. I was lazy so when it crashed I just forced rebooted it (the power button). Then crashes became more common untill my system wouldn't even boot.

So I reinstalled and I had some trouble generating dracut bundles, because some zstd copression was corrupted. After booting freshly installed os it would crash again right before the login should show up. Switching kernel (from hardened to zen) fixed the problem. Then I installed basic apps (browsers, office, crypto stuff, steam, etc.) I rebooted and when I typed the password for my encrypted root it was wrong (Im sure I typed it correctly).

I have no idea wtf went wrong with my system. I have almost the same everthing on my laptop (hardened, btrfs, luks encrytped drives, systemd boot, etc.) and it works great. And I never experienced any crashes on live usb on my pc.

I ran some random test (its passmark memtest86 v9.3 pro) on my medicat usb. Right now its 92% finished with 1070 errors. This just can't be good :(

Now I will play with some bios settings (like disable xmp), reflash other version, maybe switch a ssd... I will also try other distro, but I can't daily drive them. Arch gives me a ton of flexibility and I don't want to lose it. Maybe NixOS or Gentoo, but gentoo doesnt have systemd (I want to use Mullvad as my vpn and their app reqires it).

Do you maybe know what could be wrong and how to fix it. Thank you for reading this post and thank you very much for answering.

I don't know if this is arch bug or its something wrong with my system. If this is not right community to ask this, plese direct me to the right one (just please not reddit).

Edit: I ran memtest again without one ramstick and it gave me no errors! Thank you for all help and suggestions :)

Edit: I also tried only the faulty ram stick and the PC wouldn't even boot.

Edit: Booting PC with only the faulty ram stick corrupted my bios... I guess I will have to reflash bios anyway.

all 20 comments
sorted by: hot top controversial new old
[–] [email protected] 15 points 9 months ago

DO NOT BOOT WITH KNOWN FAULTY RAM.

Sorry for shouting but that will lead to corruption and data loss. You really should wipe your system and start from scratch as the corruption won't just fix itself. I would restore important files from a backup and then destroy everything else.

[–] [email protected] 11 points 9 months ago (1 children)

To anyone else reading this, there's something you should know:

Memory errors don't always mean the memory itself (hardware RAM stick) is bad. It can also be a power issue (bad PSU, incorrect voltage set in the UEFI), compatibility, defective memory controller (CPU or motherboard), and more.

OP almost certainly has a bad stick, but it's worthwhile for anyone building a PC to run a slew of stress tests and diagnostics before using it for anything that matters.

[–] [email protected] 1 points 9 months ago (1 children)

Interesting! What would be an approach for testing the sticks? Those usb images with some tools, for example?

[–] [email protected] 8 points 9 months ago (1 children)

I ran some random test (its passmark memtest86 v9.3 pro) on my medicat usb. Right now its 92% finished with 1070 errors. This just can’t be good :(

Not familiar with medicat. Are you saying memtest86 gave you 1070 errors ? Then one of RAM modules is faulty. Or is this about the hard disk and bad blocks ?

gentoo doesnt have systemd (I want to use Mullvad as my vpn and their app reqires it).

If I recall correctly it is technically possible to run Mullvad and OpenVPN manually without systemd for example on a SBC (pi4 etc.) as your LAN router, and feed it to your devices, but yeah this is a bit cumbersome.

[–] [email protected] 5 points 9 months ago (1 children)

Medicat is like a Ventoy (USB that can have multiple iso files).

Now I disabled XMP (makes ram faster) and ran test again and still errors. I noticed that all errors give same mesaage: expected "address", actual "wrong address" and wrong address is the same as expected address but 1 byte different. For example expected is FFFFFFF7, actual is FFFDFFF7. And this error is always on CPU core 6.

I have 2x 16gb of ram, so I will try test again with only one stick and then with other one.

[–] [email protected] 4 points 9 months ago* (last edited 9 months ago) (1 children)

Are you overclocking / doing any other unusual hardware thing? That one-bit flip is clearly pointing to broken memory, but it could be either the system RAM or the CPU cache memory (which would be real sad, but if it's always on core 6 sounds pretty likely).

[–] [email protected] 2 points 9 months ago (2 children)

XMP is somekind of overclocking, but I disabled it.

Its not only one bit flip but at leats two (in a single byte), I figured out using addresses in the errors.

I was also scared that it's the cpu, because it was the most expensive part when I build the PC. Thankfully I think it's not, now I'm running memtest again with no errors without one ram stick.

[–] [email protected] 4 points 9 months ago (1 children)

Was it just one bad stick? When you've identified one slot that's acting up, shuffle the remaining sticks around to rule out the socket going bad.

[–] [email protected] 3 points 9 months ago (1 children)

Yep, just one stick. Now everything works like it should!

[–] [email protected] 1 points 9 months ago

Nice!! I love a happy ending.

[–] [email protected] 2 points 9 months ago (1 children)

Hey, that's wonderful! Good to hear. Yeah I would just throw away the memory and do a certain amount of double-checking of what's on your disk, as some of it may have been corrupted during the time the broken memory was in there. But yeah if you can run and do stuff without errors after taking out the bad stick then that sounds like progress.

[–] [email protected] 2 points 9 months ago

Thank you for your help!

[–] [email protected] 8 points 9 months ago

Sounds like serious hardware problems (bad memory sounds highly likely if that's what memtest is telling you). Replace the faulty hardware before changing out any software, and before the badness "spreads"; you may already have corrupted a certain amount of the data / installed software on your disk by writing back data after the bad memory corrupted it, if you've been running on the broken hardware for that long.

[–] [email protected] 4 points 9 months ago

Gentoo does have systemd, actually—package sys-apps/systemd—and there are optional sections in the install documents that explain how to go about using it as your primary init. It's an officially supported configuration, just not the default.

(But yeah, as for the main problem, sounds like hardware—RAM, your primary hard disk, or the disk controller on the mobo. Start with The Bleeding Obvious and make sure all cables are solid in their sockets and all the RAM is properly inserted.)

[–] [email protected] 3 points 9 months ago

I’d do some compute tests on a usb live system. Something like y-cruncher for instance. Could be all kinds of things. Power supply could be flaky, gpu, cpu,motherboard, storage. best suggestion is process of elimination.

Try using it with a gui and see if you can crash it if you don’t have alternatives for example. Check the storage’s smart health. Check dmesg. Ssh in when a crash happens if possible.

[–] [email protected] 2 points 9 months ago* (last edited 9 months ago)

FWIW I've also had memory issues with XMP.

Turns out that ASUS firmware is omega pepega and decided to go against AMD's specifications even for XMP profiles.

CLDO VDDP was stuck at the same voltage as SOC. Per AMD it has to be up to VSOC - 0.1V

So, after manually setting that, and other VDDP and VDDG voltages, it magically started working perfectly.

So do check voltages anyway even if you found a bad stick. Mine endured through the crappy firmware thanks to it being Samsung B-die.

Also check this for more info in general (I recommend this even if you won't OC, just the memtest alone is a huge section)

https://github.com/integralfx/MemTestHelper/blob/oc-guide/DDR4%20OC%20Guide.md

I tested with OCCT to find even more errors, so either do that in a mini windows environment or do one of the Linux tests to check memory some more. Memtest86+ isn't enough.