Nasty AMDGPU initialization failure on kernel 6.12.15

When trying to get my PowerColor AMD RX 6800 GPU up on my Ampere Altra Q64-22, I experience these bad looking kernel messages.

[    5.881353] amdgpu 0001:03:00.0: [drm] *ERROR* No EDID read.
[    6.312966] amdgpu 0001:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[    6.323979] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[    6.331387] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v10_0> failed -110
[    6.340779] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    6.347210] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
[    6.534252] amdgpu 0001:03:00.0: probe with driver amdgpu failed with error -110
[    6.608299] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[    6.617087] Mem abort info:
[    6.619868]   ESR = 0x0000000096000005
[    6.623610]   EC = 0x25: DABT (current EL), IL = 32 bits
[    6.628910]   SET = 0, FnV = 0
[    6.631954]   EA = 0, S1PTW = 0
[    6.635082]   FSC = 0x05: level 1 translation fault
[    6.639947] Data abort info:
[    6.642817]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[    6.648289]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    6.653331]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    6.658630] user pgtable: 64k pages, 48-bit VAs, pgdp=000008001d58c000
[    6.665148] [0000000000000058] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[    6.673838] Internal error: Oops: 0000000096000005 [#1] SMP
[    6.679399] Modules linked in: hid_logitech_dj(+) amdgpu(+) drm_ttm_helper ttm video drm_exec drm_suballoc_helper amdxcp drm_buddy nvme gpu_sched nvme_core drm_display_helper uas cec nvme_auth dm_mod dax zfs(PO) spl(O)
[    6.698775] CPU: 21 UID: 0 PID: 695 Comm: (udev-worker) Tainted: P        W  O       6.12.15 #1-NixOS
[    6.707983] Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE
[    6.714410] Hardware name:  ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.06 04/17/2024
[    6.721705] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    6.728654] pc : ttm_resource_move_to_lru_tail+0xb4/0x1a8 [ttm]
[    6.734568] lr : ttm_resource_move_to_lru_tail+0x98/0x1a8 [ttm]
[    6.740479] sp : ffff80008d38f6d0
[    6.743781] x29: ffff80008d38f6d0 x28: 0000000000000000 x27: ffffbdb013363db8
[    6.750905] x26: ffffbdb013363a80 x25: ffffbdb07144ef50 x24: ffff6dd27dc44000
[    6.758028] x23: 0000000000000000 x22: ffff6dd2203bd338 x21: 0000000000000020
[    6.765152] x20: 0000000000000050 x19: ffff6dd2203bd300 x18: 0000000000000000
[    6.772275] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[    6.779398] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[    6.786521] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffbdb0128f0e08
[    6.793644] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[    6.800767] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[    6.807890] x2 : 0000000000000050 x1 : ffff6dd2203bd338 x0 : ffff6dd2203bd338
[    6.815013] Call trace:
[    6.817447]  ttm_resource_move_to_lru_tail+0xb4/0x1a8 [ttm]
[    6.823011]  ttm_bo_move_to_lru_tail+0x20/0x50 [ttm]
[    6.827969]  amdgpu_bo_free_kernel+0xac/0x1c0 [amdgpu]
[    6.833625]  amdgpu_doorbell_fini+0x24/0x60 [amdgpu]
[    6.839097]  amdgpu_device_fini_sw+0x3b4/0x430 [amdgpu]
[    6.844829]  amdgpu_driver_release_kms+0x24/0x50 [amdgpu]
[    6.850734]  drm_dev_put.part.0+0xb0/0x130
[    6.854820]  devm_drm_dev_init_release+0x1c/0x50
[    6.859425]  devm_action_release+0x1c/0x40
[    6.863509]  release_nodes+0x6c/0x100
[    6.867160]  devres_release_all+0xa8/0x160
[    6.871244]  device_unbind_cleanup+0x20/0x80
[    6.875503]  really_probe+0x1e8/0x3c0
[    6.879155]  __driver_probe_device+0x84/0x180
[    6.883500]  driver_probe_device+0x44/0x140
[    6.887672]  __driver_attach+0xf4/0x270
[    6.891496]  bus_for_each_dev+0x84/0x110
[    6.895407]  driver_attach+0x2c/0x60
[    6.898971]  bus_add_driver+0x170/0x2c0
[    6.902796]  driver_register+0x70/0x168
[    6.906620]  __pci_register_driver+0x4c/0x80
[    6.910879]  amdgpu_init+0x74/0xfff8 [amdgpu]
[    6.915740]  do_one_initcall+0x60/0x2e0
[    6.919565]  do_init_module+0x90/0x280
[    6.923302]  load_module+0x1d28/0x22e8
[    6.927039]  __do_sys_init_module+0x218/0x2d8
[    6.931383]  __arm64_sys_init_module+0x24/0x48
[    6.935815]  invoke_syscall+0x50/0x160
[    6.939553]  el0_svc_common.constprop.0+0x48/0x130
[    6.944333]  do_el0_svc+0x24/0x50
[    6.947637]  el0_svc+0x38/0x140
[    6.950767]  el0t_64_sync_handler+0x140/0x150
[    6.955112]  el0t_64_sync+0x190/0x198
[    6.958763] Code: f9000001 8b1512f5 aa1403e2 aa1603e0 (f9401eb7) 
[    6.964843] ---[ end trace 0000000000000000 ]---

I can sometimes get the GPU to come up and sometimes it doesn’t. It helps if I specify console=tty1 but that seems to work every 3 tries. It seems systemd’s journalctl gets stuck every 2nd from the 3rd try. Getting the GPU to power up requires a series of attempts until the GPU becomes happy. I’ve noticed the GPU can cause the system to get stuck so bad that it resets itself.

These are pretty bad issues and I’d like to see them fixed. I hope by reporting them here that someone who is more knowledgeable than I am is able to look at this and help me figure out a fix.

Search the forum here, this has been discussed at length.

I already have but some of the issues like the kernel having a problem at ttm_resource_move_to_lru_tail or Unable to handle kernel NULL pointer dereference at virtual address seem to be new to this forum.

Honestly, AMD GPU on arm64 and perhaps on Altra in particular is an ongoing game of whack-a-mole that imho is not sustainable. The amdgpu driver code is written with x86 focus and AMD don’t seem support non-x86 in any meaningful way. Every amdgpu release risks a new breakage in a rather inscrutable and complex code. Nvidia at least put dedicated effort into arm64 drivers and they work. Intel Xe also seems to work with little patching other than the Altra PCIe bug workarounds. See the posts from q66 here re: Intel Xe.

My AMD card seems to work perfectly fine on my AmpereOne system.

1 Like

I’m definitely looking forward to the ASRock AmpereOne board.