First is the random NVME disconnecting issue that I’ve been experiencing since purchase from Newegg in November 2024. This happens randomly with Samsung, Crucial, Kingston, and Sabrent drives. This seems to happen after only a month or so of running. Originally, I assumed this was due to drive issues with Samsung 990 Pros (posted about this back in December/January), as they had firmware issues from the factory. Even after updating the firmware, they still seem to just disappear. It’s also not specific to either port; the devices just randomly drop off. Always fun when it’s the block device with /boot/efi, /boot, and /.
After almost nine months with the platform and swapping SSDs around, and trying different manufacturers as mentioned above, the only thing that seems to work is to power down the host, then power it back on, and it then sees the randomly disconnected device again and is happy for another month or so. The devices and their associated file systems all check out fine, and it just goes back to normal operation. The inconsistency of this issue is maddening at times, but I’ve just grown to live with it as it seems that it’s only specific to my use case.
I’ve set up a serial port and cable to my x86 lab box (Asus WS W680-ACE IPMI, i9-14900K, 192Gb RAM, the original Samsungs I put in the Asrock/Ampere board, all run without a single issue… no disconnecting at all) to capture debugging output and dumping a copy of /var/log/messages so I can get some info on the issue. The device just disconnects for no reason, and all I get is a smartd daemon entry indicating that it can’t scan the block device. This past time it disconnected, I didn’t even have anything running on that file system, which just mounts as /var/lib/libvirt/images for KVM instances. The host has basically been an ollama server for vscode and idea IDEs.
Jul 31 11:48:29 aragorn smartd[1625]: Device: /dev/nvme1, failed to read NVMe SMART/Health Information
Jul 31 12:16:14 aragorn smartd[1625]: Device: /dev/nvme1, open() of NVMe device failed: Resource temporarily unavailable
I would suspect a hardware issue, but the devices always check out fine and run without issue in any other board that I install them in. They also come back after a cold boot, never a warm boot.
I’ve gotten to the point that I just init 0 the host, then turn it back on through IPMI, and it works without issue for another month or so. I assumed temps were causing a problem since the NVMEs sit under RTX 4000 GPUs, so as part of testing, I configured smartd.conf to log if the temperatures exceed 45c which is below what the operating temperatures are for the current devices (Crucial T500s), and at most, even when running VMs, they hit around 39C.
Case has plenty of airflow; it’s in a 4U case with three front 120MM Noctua NF-F12 Industrial fans as intake that blow air over the entire mainboard and all components. I have a Noctua CPU fan as well, and everything seems to be happy temperature-wise.
I’m at a loss as to why this is happening randomly and with any manufacturer’s NVME drive and in either NVME port. Just so random, and that drives me up a wall.
The second issue, whenever I init 0/shutdown the host, the IPMI reboots as well. So I have to wait until the IPMI is booted again before I can power on the host.
Running the latest “beta” firmware, RHEL 9.6, though tempted to either go with 10 or switch back to a Debian-based distro with a newer kernel. I typically keep the home lab in parity with what I have at work, and in the government space, it’s all RHEL, so I run RHEL for my sandboxes.