Two really annoying issues has me wondering if I got a bad board

dohertyctl · August 5, 2025, 12:28pm

First is the random NVME disconnecting issue that I’ve been experiencing since purchase from Newegg in November 2024. This happens randomly with Samsung, Crucial, Kingston, and Sabrent drives. This seems to happen after only a month or so of running. Originally, I assumed this was due to drive issues with Samsung 990 Pros (posted about this back in December/January), as they had firmware issues from the factory. Even after updating the firmware, they still seem to just disappear. It’s also not specific to either port; the devices just randomly drop off. Always fun when it’s the block device with /boot/efi, /boot, and /.

After almost nine months with the platform and swapping SSDs around, and trying different manufacturers as mentioned above, the only thing that seems to work is to power down the host, then power it back on, and it then sees the randomly disconnected device again and is happy for another month or so. The devices and their associated file systems all check out fine, and it just goes back to normal operation. The inconsistency of this issue is maddening at times, but I’ve just grown to live with it as it seems that it’s only specific to my use case.

I’ve set up a serial port and cable to my x86 lab box (Asus WS W680-ACE IPMI, i9-14900K, 192Gb RAM, the original Samsungs I put in the Asrock/Ampere board, all run without a single issue… no disconnecting at all) to capture debugging output and dumping a copy of /var/log/messages so I can get some info on the issue. The device just disconnects for no reason, and all I get is a smartd daemon entry indicating that it can’t scan the block device. This past time it disconnected, I didn’t even have anything running on that file system, which just mounts as /var/lib/libvirt/images for KVM instances. The host has basically been an ollama server for vscode and idea IDEs.

Jul 31 11:48:29 aragorn smartd[1625]: Device: /dev/nvme1, failed to read NVMe SMART/Health Information
Jul 31 12:16:14 aragorn smartd[1625]: Device: /dev/nvme1, open() of NVMe device failed: Resource temporarily unavailable

I would suspect a hardware issue, but the devices always check out fine and run without issue in any other board that I install them in. They also come back after a cold boot, never a warm boot.

I’ve gotten to the point that I just init 0 the host, then turn it back on through IPMI, and it works without issue for another month or so. I assumed temps were causing a problem since the NVMEs sit under RTX 4000 GPUs, so as part of testing, I configured smartd.conf to log if the temperatures exceed 45c which is below what the operating temperatures are for the current devices (Crucial T500s), and at most, even when running VMs, they hit around 39C.

Case has plenty of airflow; it’s in a 4U case with three front 120MM Noctua NF-F12 Industrial fans as intake that blow air over the entire mainboard and all components. I have a Noctua CPU fan as well, and everything seems to be happy temperature-wise.

I’m at a loss as to why this is happening randomly and with any manufacturer’s NVME drive and in either NVME port. Just so random, and that drives me up a wall.

The second issue, whenever I init 0/shutdown the host, the IPMI reboots as well. So I have to wait until the IPMI is booted again before I can power on the host.

Running the latest “beta” firmware, RHEL 9.6, though tempted to either go with 10 or switch back to a Debian-based distro with a newer kernel. I typically keep the home lab in parity with what I have at work, and in the government space, it’s all RHEL, so I run RHEL for my sandboxes.

Richard · August 7, 2025, 2:12am

I have never heard others talk about NVME disconnect issue. but consoder you mentioned about different verdor drivers, I would like to sugges to add pcie_aspm=off to kernel parameter, even my x86 system also needs this parameter.

in grab and add pcie_aspm=off, then update your grab.conf
GRUB_CMDLINE_LINUX_DEFAULT=“domdadm console=tty0 console=ttyAMA0 pcie_aspm=off”

David.Zeng · August 7, 2025, 5:28am

When the disconnection happen, is there any hardware related issue in the dmesg output?

shadowfox · August 9, 2025, 2:57pm

Are you talking about ALTRAD8UD-1L2T? Because I’ve got the same issue with Viper + Samsung SSDs and it goes away completely when only the Samsung one is installed

Topic		Replies	Views
Am I going crazy? Probably but maybe it's the Samsung 990 Pros General Discussion	5	212	March 12, 2025
Ampere - 10GbE nic's creating lots of errors - Intel X710 - Ubuntu 22.04 General Discussion ampere	8	301	May 13, 2025
New issue with Asrock Rack ALTRAD8UD-1L2T with Q64-22 General Discussion	20	795	July 5, 2025
MP72-HB0 OB_LAN_TEMP sensor -> 105C! General Discussion	2	122	June 24, 2025
ASRock Rack Ampere bundle General Discussion	58	2518	December 22, 2024

Two really annoying issues has me wondering if I got a bad board

Related topics