I wonder if anyone else has noticed this or had similar issues with Samsung 990 Pros in their Ampere builds. I have had two Samsung 990 Pros mysteriously disappear from the OS. Rebooting does not bring them back; the disk returns only by powering down the host and starting it up again.
I seem to get about a week or so with the 4Tb one, and then it decides to play hide and seek (lspci shows nothing, reboot to firmware, and it shows only the 1Tb 990 Pro, which contains the OS). This got so annoying that I removed it and bought a Crucial T700 to replace it with. I put the Samsung 990 Pro 4Tb in another PC in my basement (13th gen i5 booting from a Sabrent 500Gb NVME) with FreeBSD 14 on it with a script that randomly writes and deletes files (not aggressively) using both a measured size dd with count and also with /dev/uramdom with random sizes trying to reproduce the issue and also, running bhyve to simulate what I was doing on my Ampere host. It’s been stable as a rock for a few weeks without issue.
This morning, I woke up to a nonfunctioning host with my Ampere ASRock board. When I rebooted (not powering off) using OpenBMC, it attempted to PXE boot off my NAS, which hosts a TFTP server and exported directories for the shim, pxe boot menu and kickstart through httpd. So I rebooted it to the firmware/BIOS, and my 1Tb Samsung 990 Pro is not showing up, just the Crucial T700 that mounts as /var/lib/libvirt/images. The 1Tb 990 Pro /boot, /boot/efi, /, /home, /var, /tmp, the myriad of Red Hat virtual mounts, and the standard /proc /sys user space virtual file systems through glibc.
I then decided that I’d power it off and on again, entered the firmware/BIOS, and just like the 4Tb 990 Pro, it decided to show up to the party again. I checked the firmware version on both 990 Pros, and they are at the latest, according to what Samsung has on their site. So, at this point, it’s either bad luck on my part that I got potentially problematic 990 Pros, or possibly that there is some unknown compatibility issue. I figured I’d check here first to see if anyone else has experienced a similar issue with Samsung 990 Pros with their Ampere builds.
Oddly enough, I configured a watchdog service, and it didn’t seem to work with the proverbial carpet being pulled from under the OS. I would have expected that if the storage device disappeared from the running OS that watchdog would have rebooted the host, and I would have woken up to a console showing my PXE boot menu (has no time out on purpose).
On a side note, I haven’t had the Crucial T700 in long enough to see if that has any issues. Samsung has been my primary NVME and SATA SSD goto for years, Crucial is secondary as I’ve never had any issues with either of these manufacturers in the past decade and a half. I’ll buy others like Sabrent or Kingston if I need something quick and cheaper in price when flipping older office PCs or something for the wife and kids PCs.