Am I going crazy? Probably but maybe it's the Samsung 990 Pros

I wonder if anyone else has noticed this or had similar issues with Samsung 990 Pros in their Ampere builds. I have had two Samsung 990 Pros mysteriously disappear from the OS. Rebooting does not bring them back; the disk returns only by powering down the host and starting it up again.

I seem to get about a week or so with the 4Tb one, and then it decides to play hide and seek (lspci shows nothing, reboot to firmware, and it shows only the 1Tb 990 Pro, which contains the OS). This got so annoying that I removed it and bought a Crucial T700 to replace it with. I put the Samsung 990 Pro 4Tb in another PC in my basement (13th gen i5 booting from a Sabrent 500Gb NVME) with FreeBSD 14 on it with a script that randomly writes and deletes files (not aggressively) using both a measured size dd with count and also with /dev/uramdom with random sizes trying to reproduce the issue and also, running bhyve to simulate what I was doing on my Ampere host. It’s been stable as a rock for a few weeks without issue.

This morning, I woke up to a nonfunctioning host with my Ampere ASRock board. When I rebooted (not powering off) using OpenBMC, it attempted to PXE boot off my NAS, which hosts a TFTP server and exported directories for the shim, pxe boot menu and kickstart through httpd. So I rebooted it to the firmware/BIOS, and my 1Tb Samsung 990 Pro is not showing up, just the Crucial T700 that mounts as /var/lib/libvirt/images. The 1Tb 990 Pro /boot, /boot/efi, /, /home, /var, /tmp, the myriad of Red Hat virtual mounts, and the standard /proc /sys user space virtual file systems through glibc.

I then decided that I’d power it off and on again, entered the firmware/BIOS, and just like the 4Tb 990 Pro, it decided to show up to the party again. I checked the firmware version on both 990 Pros, and they are at the latest, according to what Samsung has on their site. So, at this point, it’s either bad luck on my part that I got potentially problematic 990 Pros, or possibly that there is some unknown compatibility issue. I figured I’d check here first to see if anyone else has experienced a similar issue with Samsung 990 Pros with their Ampere builds.

Oddly enough, I configured a watchdog service, and it didn’t seem to work with the proverbial carpet being pulled from under the OS. I would have expected that if the storage device disappeared from the running OS that watchdog would have rebooted the host, and I would have woken up to a console showing my PXE boot menu (has no time out on purpose).

On a side note, I haven’t had the Crucial T700 in long enough to see if that has any issues. Samsung has been my primary NVME and SATA SSD goto for years, Crucial is secondary as I’ve never had any issues with either of these manufacturers in the past decade and a half. I’ll buy others like Sabrent or Kingston if I need something quick and cheaper in price when flipping older office PCs or something for the wife and kids PCs.

1 Like

For those who felt the need to read the above post, I apologize for the verbosity; I tend to echo my thoughts in real time and with “-vvvv” or all the v’s. This post is also indicative of my lack of humor.

1 Like

I don’t have 990 pro , but I am using 980 pro.
As my vogue memory, I think 980/990 pro has firmware problem (
And I search the internet and find this article

Probably like your situation.

Samsung 980/990 pro problem
https://www.reddit.com/r/buildapc/comments/x82mwe/samsung_ssd_smart_0e_issue/?rdt=52947

Thanks for your reply. I had come across similar information and had been tracking the problems they have had since Samsung launched this model, as I typically run Samsung. I ensured that these were updated to the latest firmware before running them in my Ampere setup.

However, the question was more of a possible compatibility issue, as the 990 Pros with the latest firmware are working perfectly fine on an Intel-based chipset motherboard with FreeBSD for a longer duration and under more intensive load.

I was looking to see if anyone else had a similar setup with an Ampere CPU, ASRock board, and a 990 Pro. The 980 Pro and the 990 Pro use different in-house Samsung-built controllers, where a compatibility issue could present itself if the firmware drivers at the hardware level have issues properly initializing the device.

I’m using 2x 990 PRO 1TB on the ALTRAD8UD-1L2T system with Ubuntu Server 24.04, and it’s just fine.

Updated: I lost one drive weeks ago, but it’s expected because the 990 PRO is a consumer-grade drive; it’s not designed for 24x7 operation.

2025-02-24T05:06:53.371833+00:00 voyager kernel: nvme nvme2: Identify namespace failed (-5)
2025-02-24T10:21:56.092854+00:00 voyager kernel: message repeated 21 times: [ nvme nvme2: Identify namespace failed (-5)]

My both drives are 2TB Samsung 990 Pro with heatsink. So far no issues. Workstation is running 24/7 for 9 months with this setup. Last time I have updated the firmware was when I installed them in May 2024.