NVIDIA GPU on Ampere with kernel 6.11 or 6.12

Hi, is anyone using an NVIDIA GPU on Ampere with kernel 6.11 or 6.12, without issues?

Not me, I am having kernel oopses.ever since Fedora 40 upgraded to kernel 6.11. I recently upgraded to Fedora 41, which now comes with kernel 6.12 and NVIDIA drivers 565.77, but the problem is still there.

I am now still using the 6.10 kernel from Fedora 40, which keeps working fine.

Here’s an example backtrace from kernel 6.12:

Unable to handle kernel paging request at virtual address ffff8000a3616bcc
Mem abort info:
  ESR = 0x0000000096000021
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x21: alignment fault
Data abort info:
  ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000081a2edd3000
[ffff8000a3616bcc] pgd=100008000033d003, p4d=100008000033d003, pud=100008000033e003, pmd=100008004c092003, pte=00683000028baf13
Internal error: Oops: 0000000096000021 [#1] SMP
Modules linked in: nvidia_uvm(OE) uinput snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core pppoe pppox ppp_generic slhc 8021q garp mrp stp llc cfg80211 rfkill nft_chain_nat xt_MASQUERADE nf_nat xt_helper xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT ipt_REJECT nf_reject_ipv6 nf_reject_ipv4 xt_set xt_multiport nft_compat nf_tables ip_set_hash_ip ip_set_hash_net ip_set binfmt_misc hid_logitech_hidpp cdc_ether usbnet joydev mii snd_seq_midi snd_seq_midi_event snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi mc ftdi_sio usblp xfs nvidia_drm(OE) nvidia_modeset(OE) snd_hda_codec_hdmi dm_cache_smq dm_cache dm_persistent_data dm_bio_prison vfat fat snd_hda_intel raid456 snd_intel_dspcfg snd_hda_codec async_raid6_recov async_memcpy async_pq async_xor snd_hda_core nvidia(OE) async_tx snd_hwdep snd_seq snd_seq_device snd_pcm ses acpi_ipmi enclosure video snd_timer drm_ttm_helper ipmi_ssif arm_spe_pmu ttm snd ast igb soundcore ixgbe ipmi_devintf i2c_algo_bit mdio ipmi_msghandler
 acpiphp_ampere_altra arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq acpi_tad loop dm_multipath nfsd auth_rpcgss nfs_acl lockd grace nfs_localio sunrpc nfnetlink zram hid_logitech_dj onboard_usb_dev crct10dif_ce polyval_ce mpt3sas nvme polyval_generic ghash_ce sbsa_gwdt nvme_core raid_class scsi_transport_sas nvme_auth xgene_hwmon scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse
CPU: 28 UID: 1000 PID: 18520 Comm: chrome_crashpad Tainted: G           OE      6.12.4-200.fc41.aarch64 #1
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name:  ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.06 04/17/2024
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __memcpy+0x168/0x240
lr : nvidia_vma_access+0x17c/0x200 [nvidia]
sp : ffff8000ea5d38e0
x29: ffff8000ea5d38e0 x28: 0000ffff51dfa980 x27: 0000020040000000
x26: 0000000000000980 x25: 0000000000000000 x24: 0000000000000980
x23: ffff0801af61d000 x22: 0000000000000000 x21: ffff8000a3616980
x20: 000000000000028c x19: 000000000000028c x18: 0000000000000000
x17: 0000000000000000 x16: ffffc776353c52c0 x15: ffff800080000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
x5 : ffff0801af61d28c x4 : ffff8000a3616c0c x3 : ffff0801af61d200
x2 : fffffffffffffffc x1 : ffff8000a3616bc0 x0 : ffff0801af61d000
Call trace:
 __memcpy+0x168/0x240
 __access_remote_vm+0x2e0/0x420
 access_remote_vm+0x18/0x30
 mem_rw+0x248/0x320
 mem_read+0x1c/0x30
 vfs_read+0xcc/0x330
 __arm64_sys_pread64+0xb8/0xf0
 invoke_syscall+0x6c/0x100
 el0_svc_common.constprop.0+0x48/0xf0
 do_el0_svc+0x24/0x38
 el0_svc+0x38/0x148
 el0t_64_sync_handler+0x120/0x138
 el0t_64_sync+0x194/0x198
Code: a984346c a9c4342c f1010042 54fffee8 (a97c3c8e)

After such a kernel oops, the system becomes unstable. some USB devices hang, lsusb hangs, and the system will not reboot without a hard reset.

GPU is a Quadro T1000, using the open source drivers.

By the look of it, I wouldn’t be surprised if this is the pcie bug finally affecting nvidia. :disappointed:

This was going through my mind as well. Perhaps we’ve just been lucky that nvidia drivers have never triggered the pcie bug before. That’s why I am posting here, and not on the NVidia or Fedora forums.

Perhaps its is now time to try to get the pcie patches into the main kernel. For that to happen, it needs to be possible to enable or disable the patch at runtime rather than compile time. Either automatic or using a kernel parameter.

It would be informative if others on this forum could share their experience re: NVidia on linux kernels 6.11 or 6.12. On my system, I have successfully updated to kernel 6.12.5 with nvidia driver 565.77. I have not experienced any kernel errors/oopses or panics. I have tested both the dual-licensed (“open”) kernel module and the proprietary-licensed kernel module. I have tested both X11 and wayland.

I have an RTX 3060 and currently use Ultramarine Linux, which is based upon Fedora. I am currently on the version 40 release.

Using chromium would trigger the kernel oops.The last oops happened while closing chromium.

It would be nice to hear from other people on the forum because I cannot reproduce this with chromium on my Ultramarine 40 install, and don’t currently have disk space to put a vanilla Fedora install on my system.

Did you try to downgrade the NVIDIA driver?