Nasty AMDGPU initialization failure on kernel 6.12.15

TheComputerGuy · February 28, 2025, 2:17am

When trying to get my PowerColor AMD RX 6800 GPU up on my Ampere Altra Q64-22, I experience these bad looking kernel messages.

[    5.881353] amdgpu 0001:03:00.0: [drm] *ERROR* No EDID read.
[    6.312966] amdgpu 0001:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[    6.323979] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[    6.331387] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v10_0> failed -110
[    6.340779] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    6.347210] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
[    6.534252] amdgpu 0001:03:00.0: probe with driver amdgpu failed with error -110
[    6.608299] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[    6.617087] Mem abort info:
[    6.619868]   ESR = 0x0000000096000005
[    6.623610]   EC = 0x25: DABT (current EL), IL = 32 bits
[    6.628910]   SET = 0, FnV = 0
[    6.631954]   EA = 0, S1PTW = 0
[    6.635082]   FSC = 0x05: level 1 translation fault
[    6.639947] Data abort info:
[    6.642817]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[    6.648289]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    6.653331]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    6.658630] user pgtable: 64k pages, 48-bit VAs, pgdp=000008001d58c000
[    6.665148] [0000000000000058] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[    6.673838] Internal error: Oops: 0000000096000005 [#1] SMP
[    6.679399] Modules linked in: hid_logitech_dj(+) amdgpu(+) drm_ttm_helper ttm video drm_exec drm_suballoc_helper amdxcp drm_buddy nvme gpu_sched nvme_core drm_display_helper uas cec nvme_auth dm_mod dax zfs(PO) spl(O)
[    6.698775] CPU: 21 UID: 0 PID: 695 Comm: (udev-worker) Tainted: P        W  O       6.12.15 #1-NixOS
[    6.707983] Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE
[    6.714410] Hardware name:  ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.06 04/17/2024
[    6.721705] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    6.728654] pc : ttm_resource_move_to_lru_tail+0xb4/0x1a8 [ttm]
[    6.734568] lr : ttm_resource_move_to_lru_tail+0x98/0x1a8 [ttm]
[    6.740479] sp : ffff80008d38f6d0
[    6.743781] x29: ffff80008d38f6d0 x28: 0000000000000000 x27: ffffbdb013363db8
[    6.750905] x26: ffffbdb013363a80 x25: ffffbdb07144ef50 x24: ffff6dd27dc44000
[    6.758028] x23: 0000000000000000 x22: ffff6dd2203bd338 x21: 0000000000000020
[    6.765152] x20: 0000000000000050 x19: ffff6dd2203bd300 x18: 0000000000000000
[    6.772275] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[    6.779398] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[    6.786521] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffbdb0128f0e08
[    6.793644] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[    6.800767] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[    6.807890] x2 : 0000000000000050 x1 : ffff6dd2203bd338 x0 : ffff6dd2203bd338
[    6.815013] Call trace:
[    6.817447]  ttm_resource_move_to_lru_tail+0xb4/0x1a8 [ttm]
[    6.823011]  ttm_bo_move_to_lru_tail+0x20/0x50 [ttm]
[    6.827969]  amdgpu_bo_free_kernel+0xac/0x1c0 [amdgpu]
[    6.833625]  amdgpu_doorbell_fini+0x24/0x60 [amdgpu]
[    6.839097]  amdgpu_device_fini_sw+0x3b4/0x430 [amdgpu]
[    6.844829]  amdgpu_driver_release_kms+0x24/0x50 [amdgpu]
[    6.850734]  drm_dev_put.part.0+0xb0/0x130
[    6.854820]  devm_drm_dev_init_release+0x1c/0x50
[    6.859425]  devm_action_release+0x1c/0x40
[    6.863509]  release_nodes+0x6c/0x100
[    6.867160]  devres_release_all+0xa8/0x160
[    6.871244]  device_unbind_cleanup+0x20/0x80
[    6.875503]  really_probe+0x1e8/0x3c0
[    6.879155]  __driver_probe_device+0x84/0x180
[    6.883500]  driver_probe_device+0x44/0x140
[    6.887672]  __driver_attach+0xf4/0x270
[    6.891496]  bus_for_each_dev+0x84/0x110
[    6.895407]  driver_attach+0x2c/0x60
[    6.898971]  bus_add_driver+0x170/0x2c0
[    6.902796]  driver_register+0x70/0x168
[    6.906620]  __pci_register_driver+0x4c/0x80
[    6.910879]  amdgpu_init+0x74/0xfff8 [amdgpu]
[    6.915740]  do_one_initcall+0x60/0x2e0
[    6.919565]  do_init_module+0x90/0x280
[    6.923302]  load_module+0x1d28/0x22e8
[    6.927039]  __do_sys_init_module+0x218/0x2d8
[    6.931383]  __arm64_sys_init_module+0x24/0x48
[    6.935815]  invoke_syscall+0x50/0x160
[    6.939553]  el0_svc_common.constprop.0+0x48/0x130
[    6.944333]  do_el0_svc+0x24/0x50
[    6.947637]  el0_svc+0x38/0x140
[    6.950767]  el0t_64_sync_handler+0x140/0x150
[    6.955112]  el0t_64_sync+0x190/0x198
[    6.958763] Code: f9000001 8b1512f5 aa1403e2 aa1603e0 (f9401eb7) 
[    6.964843] ---[ end trace 0000000000000000 ]---

I can sometimes get the GPU to come up and sometimes it doesn’t. It helps if I specify console=tty1 but that seems to work every 3 tries. It seems systemd’s journalctl gets stuck every 2nd from the 3rd try. Getting the GPU to power up requires a series of attempts until the GPU becomes happy. I’ve noticed the GPU can cause the system to get stuck so bad that it resets itself.

These are pretty bad issues and I’d like to see them fixed. I hope by reporting them here that someone who is more knowledgeable than I am is able to look at this and help me figure out a fix.

sevo · March 1, 2025, 6:16am

Search the forum here, this has been discussed at length.

TheComputerGuy · March 1, 2025, 6:31am

I already have but some of the issues like the kernel having a problem at ttm_resource_move_to_lru_tail or Unable to handle kernel NULL pointer dereference at virtual address seem to be new to this forum.

sevo · March 3, 2025, 9:01pm

Honestly, AMD GPU on arm64 and perhaps on Altra in particular is an ongoing game of whack-a-mole that imho is not sustainable. The amdgpu driver code is written with x86 focus and AMD don’t seem support non-x86 in any meaningful way. Every amdgpu release risks a new breakage in a rather inscrutable and complex code. Nvidia at least put dedicated effort into arm64 drivers and they work. Intel Xe also seems to work with little patching other than the Altra PCIe bug workarounds. See the posts from q66 here re: Intel Xe.

bexcran · March 19, 2025, 6:10pm

My AMD card seems to work perfectly fine on my AmpereOne system.

TheComputerGuy · March 21, 2025, 6:42am

I’m definitely looking forward to the ASRock AmpereOne board.

TheComputerGuy · May 5, 2025, 1:20am

I got an RTX 5070 recently, COSMIC works perfectly after getting the open NVIDIA driver.

bexcran · July 26, 2025, 3:21pm

In case anyone else comes across this, I had this same problem with a new RX 6700 XT.
There were actually two problems: first was a panic (see screenshot) that I fixed by enabling Resizable BAR in the UEFI Settings.

Then I got the “*ERROR* KCQ enable failed” error, which I found could be worked around by unplugging the second monitor I’d connected.

Topic		Replies	Views
Dogfooding Ampere (or Arm) General Discussion	10	776	November 19, 2024
AMD GPUs on the Altra devkit and other Altras - patches available now General Discussion	23	2130	June 12, 2024
GPU support for Ampere Altra? General Discussion	26	3561	July 18, 2024
NVIDIA GPU on Ampere with kernel 6.11 or 6.12 General Discussion	10	205	February 4, 2025
Intel Arc on Ampere Altra (unstable but somewhat working) General Discussion	22	918	March 15, 2025

Nasty AMDGPU initialization failure on kernel 6.12.15

Related topics