NVIDIA GPU on Ampere with kernel 6.11 or 6.12

Hi, is anyone using an NVIDIA GPU on Ampere with kernel 6.11 or 6.12, without issues?

Not me, I am having kernel oopses.ever since Fedora 40 upgraded to kernel 6.11. I recently upgraded to Fedora 41, which now comes with kernel 6.12 and NVIDIA drivers 565.77, but the problem is still there.

I am now still using the 6.10 kernel from Fedora 40, which keeps working fine.

Here’s an example backtrace from kernel 6.12:

Unable to handle kernel paging request at virtual address ffff8000a3616bcc
Mem abort info:
  ESR = 0x0000000096000021
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x21: alignment fault
Data abort info:
  ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000081a2edd3000
[ffff8000a3616bcc] pgd=100008000033d003, p4d=100008000033d003, pud=100008000033e003, pmd=100008004c092003, pte=00683000028baf13
Internal error: Oops: 0000000096000021 [#1] SMP
Modules linked in: nvidia_uvm(OE) uinput snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core pppoe pppox ppp_generic slhc 8021q garp mrp stp llc cfg80211 rfkill nft_chain_nat xt_MASQUERADE nf_nat xt_helper xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT ipt_REJECT nf_reject_ipv6 nf_reject_ipv4 xt_set xt_multiport nft_compat nf_tables ip_set_hash_ip ip_set_hash_net ip_set binfmt_misc hid_logitech_hidpp cdc_ether usbnet joydev mii snd_seq_midi snd_seq_midi_event snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi mc ftdi_sio usblp xfs nvidia_drm(OE) nvidia_modeset(OE) snd_hda_codec_hdmi dm_cache_smq dm_cache dm_persistent_data dm_bio_prison vfat fat snd_hda_intel raid456 snd_intel_dspcfg snd_hda_codec async_raid6_recov async_memcpy async_pq async_xor snd_hda_core nvidia(OE) async_tx snd_hwdep snd_seq snd_seq_device snd_pcm ses acpi_ipmi enclosure video snd_timer drm_ttm_helper ipmi_ssif arm_spe_pmu ttm snd ast igb soundcore ixgbe ipmi_devintf i2c_algo_bit mdio ipmi_msghandler
 acpiphp_ampere_altra arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq acpi_tad loop dm_multipath nfsd auth_rpcgss nfs_acl lockd grace nfs_localio sunrpc nfnetlink zram hid_logitech_dj onboard_usb_dev crct10dif_ce polyval_ce mpt3sas nvme polyval_generic ghash_ce sbsa_gwdt nvme_core raid_class scsi_transport_sas nvme_auth xgene_hwmon scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse
CPU: 28 UID: 1000 PID: 18520 Comm: chrome_crashpad Tainted: G           OE      6.12.4-200.fc41.aarch64 #1
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name:  ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.06 04/17/2024
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __memcpy+0x168/0x240
lr : nvidia_vma_access+0x17c/0x200 [nvidia]
sp : ffff8000ea5d38e0
x29: ffff8000ea5d38e0 x28: 0000ffff51dfa980 x27: 0000020040000000
x26: 0000000000000980 x25: 0000000000000000 x24: 0000000000000980
x23: ffff0801af61d000 x22: 0000000000000000 x21: ffff8000a3616980
x20: 000000000000028c x19: 000000000000028c x18: 0000000000000000
x17: 0000000000000000 x16: ffffc776353c52c0 x15: ffff800080000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
x5 : ffff0801af61d28c x4 : ffff8000a3616c0c x3 : ffff0801af61d200
x2 : fffffffffffffffc x1 : ffff8000a3616bc0 x0 : ffff0801af61d000
Call trace:
 __memcpy+0x168/0x240
 __access_remote_vm+0x2e0/0x420
 access_remote_vm+0x18/0x30
 mem_rw+0x248/0x320
 mem_read+0x1c/0x30
 vfs_read+0xcc/0x330
 __arm64_sys_pread64+0xb8/0xf0
 invoke_syscall+0x6c/0x100
 el0_svc_common.constprop.0+0x48/0xf0
 do_el0_svc+0x24/0x38
 el0_svc+0x38/0x148
 el0t_64_sync_handler+0x120/0x138
 el0t_64_sync+0x194/0x198
Code: a984346c a9c4342c f1010042 54fffee8 (a97c3c8e)

After such a kernel oops, the system becomes unstable. some USB devices hang, lsusb hangs, and the system will not reboot without a hard reset.

GPU is a Quadro T1000, using the open source drivers.

By the look of it, I wouldn’t be surprised if this is the pcie bug finally affecting nvidia. :disappointed:

This was going through my mind as well. Perhaps we’ve just been lucky that nvidia drivers have never triggered the pcie bug before. That’s why I am posting here, and not on the NVidia or Fedora forums.

Perhaps its is now time to try to get the pcie patches into the main kernel. For that to happen, it needs to be possible to enable or disable the patch at runtime rather than compile time. Either automatic or using a kernel parameter.

It would be informative if others on this forum could share their experience re: NVidia on linux kernels 6.11 or 6.12. On my system, I have successfully updated to kernel 6.12.5 with nvidia driver 565.77. I have not experienced any kernel errors/oopses or panics. I have tested both the dual-licensed (“open”) kernel module and the proprietary-licensed kernel module. I have tested both X11 and wayland.

I have an RTX 3060 and currently use Ultramarine Linux, which is based upon Fedora. I am currently on the version 40 release.

Using chromium would trigger the kernel oops.The last oops happened while closing chromium.

It would be nice to hear from other people on the forum because I cannot reproduce this with chromium on my Ultramarine 40 install, and don’t currently have disk space to put a vanilla Fedora install on my system.

Did you try to downgrade the NVIDIA driver?

Did you try to downgrade the NVIDIA driver?

No, in fact I was forced to upgrade the NVIDIA driver for the old Fedora 40 6.10 kernel, because Fedora 41 (with rpmfusion) only supports 565.77.
But NVIDIA driver versions were never a problem. The problem was always kernel versions 6.11 and 6.12.

I replaced my T1000 with an RTX 4060, hoping the newer architecture would help, but still the same problem.

Again, I could trigger the problem with starting and closing chromium. In other cases, where I did not try to force the issue, I would not see a kernel oops, but the system would just hang after a while, in a strange way. The serial console would still present a logon prompt, but would not respond to a valid login.

Not seeing anything after a forced reboot, I wondered whether this was related to NVIDIA at all. But removing the card, leaving only the Aspeed VGA display would make the system run rock solid for days. I would notice if it did not, because this system is my fiber internet router, NAS, PVR and workstation at the same time.

Now, back at kernel 6.10 everything works fine again.

Perhaps this problem has to do with a conflict with any of my many other PCIe cards:

$ lspci | grep -v Ampere
0000:01:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller
0001:01:00.0 VGA compatible controller: NVIDIA Corporation AD107 [GeForce RTX 4060] (rev a1)
0001:01:00.1 Audio device: NVIDIA Corporation AD107 High Definition Audio Controller (rev a1)
0002:03:00.0 USB controller: ASMedia Technology Inc. ASM3042 USB 3.2 Gen 1 xHCI Controller
0002:04:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0003:01:00.0 PCI bridge: ASRock Incorporation Device 1150 (rev 04)
0003:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
0003:03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
0003:03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
0003:05:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN850X NVMe SSD (rev 01)
0003:06:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN850X NVMe SSD (rev 01)
0004:01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
0004:01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
0004:03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3416 Fusion-MPT Tri-Mode I/O Controller Chip (IOC) (rev 01)
000d:01:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller
000d:03:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller

NVIDIA released a new beta driver version 570.86.16 yesterday which has solved the problems I was having with an RTX 2000 Ada Generation. I was seeing many application crash due to a memcpy call from an nvidia library.

2 Likes

570.86.16 is indeed a big improvement.

With the previous version, no kernel would work with my RTX 4060, not even kernel 6.10 that would work fine with my Quadro T1000. Now, I can use the latest Fedora kernel (6.12.11-200.fc41) just fine with my RTX 4060.

I had to enable the negativo17 repo again to get this update, as rpmfusion is slow to get updates. I will go back to rpmfusion as soon as they have it, as having both negativo17 and rpmfusion enabled causes issues, and I need rpmfusion for more than just Nvidia.

Still, I fear that issues like these will come back, as Nvidia and the Linux kernel are updated separately. If the Nvidia open source drivers will not become part of the Linux kernel, then my hope is that Intel xe drivers will become mature at some point, and usable on non-x86 platforms.

1 Like