When trying to get my PowerColor AMD RX 6800 GPU up on my Ampere Altra Q64-22, I experience these bad looking kernel messages.
[ 5.881353] amdgpu 0001:03:00.0: [drm] *ERROR* No EDID read.
[ 6.312966] amdgpu 0001:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 6.323979] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[ 6.331387] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v10_0> failed -110
[ 6.340779] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.347210] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
[ 6.534252] amdgpu 0001:03:00.0: probe with driver amdgpu failed with error -110
[ 6.608299] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[ 6.617087] Mem abort info:
[ 6.619868] ESR = 0x0000000096000005
[ 6.623610] EC = 0x25: DABT (current EL), IL = 32 bits
[ 6.628910] SET = 0, FnV = 0
[ 6.631954] EA = 0, S1PTW = 0
[ 6.635082] FSC = 0x05: level 1 translation fault
[ 6.639947] Data abort info:
[ 6.642817] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[ 6.648289] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 6.653331] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 6.658630] user pgtable: 64k pages, 48-bit VAs, pgdp=000008001d58c000
[ 6.665148] [0000000000000058] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 6.673838] Internal error: Oops: 0000000096000005 [#1] SMP
[ 6.679399] Modules linked in: hid_logitech_dj(+) amdgpu(+) drm_ttm_helper ttm video drm_exec drm_suballoc_helper amdxcp drm_buddy nvme gpu_sched nvme_core drm_display_helper uas cec nvme_auth dm_mod dax zfs(PO) spl(O)
[ 6.698775] CPU: 21 UID: 0 PID: 695 Comm: (udev-worker) Tainted: P W O 6.12.15 #1-NixOS
[ 6.707983] Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE
[ 6.714410] Hardware name: ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.06 04/17/2024
[ 6.721705] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6.728654] pc : ttm_resource_move_to_lru_tail+0xb4/0x1a8 [ttm]
[ 6.734568] lr : ttm_resource_move_to_lru_tail+0x98/0x1a8 [ttm]
[ 6.740479] sp : ffff80008d38f6d0
[ 6.743781] x29: ffff80008d38f6d0 x28: 0000000000000000 x27: ffffbdb013363db8
[ 6.750905] x26: ffffbdb013363a80 x25: ffffbdb07144ef50 x24: ffff6dd27dc44000
[ 6.758028] x23: 0000000000000000 x22: ffff6dd2203bd338 x21: 0000000000000020
[ 6.765152] x20: 0000000000000050 x19: ffff6dd2203bd300 x18: 0000000000000000
[ 6.772275] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 6.779398] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 6.786521] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffbdb0128f0e08
[ 6.793644] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 6.800767] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 6.807890] x2 : 0000000000000050 x1 : ffff6dd2203bd338 x0 : ffff6dd2203bd338
[ 6.815013] Call trace:
[ 6.817447] ttm_resource_move_to_lru_tail+0xb4/0x1a8 [ttm]
[ 6.823011] ttm_bo_move_to_lru_tail+0x20/0x50 [ttm]
[ 6.827969] amdgpu_bo_free_kernel+0xac/0x1c0 [amdgpu]
[ 6.833625] amdgpu_doorbell_fini+0x24/0x60 [amdgpu]
[ 6.839097] amdgpu_device_fini_sw+0x3b4/0x430 [amdgpu]
[ 6.844829] amdgpu_driver_release_kms+0x24/0x50 [amdgpu]
[ 6.850734] drm_dev_put.part.0+0xb0/0x130
[ 6.854820] devm_drm_dev_init_release+0x1c/0x50
[ 6.859425] devm_action_release+0x1c/0x40
[ 6.863509] release_nodes+0x6c/0x100
[ 6.867160] devres_release_all+0xa8/0x160
[ 6.871244] device_unbind_cleanup+0x20/0x80
[ 6.875503] really_probe+0x1e8/0x3c0
[ 6.879155] __driver_probe_device+0x84/0x180
[ 6.883500] driver_probe_device+0x44/0x140
[ 6.887672] __driver_attach+0xf4/0x270
[ 6.891496] bus_for_each_dev+0x84/0x110
[ 6.895407] driver_attach+0x2c/0x60
[ 6.898971] bus_add_driver+0x170/0x2c0
[ 6.902796] driver_register+0x70/0x168
[ 6.906620] __pci_register_driver+0x4c/0x80
[ 6.910879] amdgpu_init+0x74/0xfff8 [amdgpu]
[ 6.915740] do_one_initcall+0x60/0x2e0
[ 6.919565] do_init_module+0x90/0x280
[ 6.923302] load_module+0x1d28/0x22e8
[ 6.927039] __do_sys_init_module+0x218/0x2d8
[ 6.931383] __arm64_sys_init_module+0x24/0x48
[ 6.935815] invoke_syscall+0x50/0x160
[ 6.939553] el0_svc_common.constprop.0+0x48/0x130
[ 6.944333] do_el0_svc+0x24/0x50
[ 6.947637] el0_svc+0x38/0x140
[ 6.950767] el0t_64_sync_handler+0x140/0x150
[ 6.955112] el0t_64_sync+0x190/0x198
[ 6.958763] Code: f9000001 8b1512f5 aa1403e2 aa1603e0 (f9401eb7)
[ 6.964843] ---[ end trace 0000000000000000 ]---
I can sometimes get the GPU to come up and sometimes it doesn’t. It helps if I specify console=tty1
but that seems to work every 3 tries. It seems systemd’s journalctl gets stuck every 2nd from the 3rd try. Getting the GPU to power up requires a series of attempts until the GPU becomes happy. I’ve noticed the GPU can cause the system to get stuck so bad that it resets itself.
These are pretty bad issues and I’d like to see them fixed. I hope by reporting them here that someone who is more knowledgeable than I am is able to look at this and help me figure out a fix.