Skip to content

GitLab

  • Menu
    • Help
    • Sign in / Register
    • A amd
    • Project information
    • Repository
    • Issues 1,126
      • List
      • Boards
      • Service Desk
      • Milestones
    • Merge requests 1
    • CI/CD
    • Deployments
    • Monitor
    • Packages & Registries
    • Analytics
    • Wiki
    • Snippets
    Collapse sidebar
    • drm
    • amd
    • Issues
    • #1519

    Open
    Created 6 months ago by Trung Lê

      [navi2][5.10.20] amdgpu module crash on RX 6900 XT card

      Hardware

      6900 XT, IBM POWER9

      Software

      • Fedora 33 (5.10.20 ppc64le 64K page size) with amdgpu (58.49.0)

      Context

      modprobe amdgpu yields following error in dmesg:

      [  263.680735] [drm] amdgpu kernel modesetting enabled.
      [  263.682186] CRAT table error: (null)
      [  263.682187] DSDT table not found for OEM information
      [  263.682189] IO link not available for non x86 platforms
      [  263.682190] Virtual CRAT table created for CPU
      [  263.682199] amdgpu: Topology: Add CPU node
      [  263.683458] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
      [  263.683472] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
      [  263.683476] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
      [  263.683489] [drm] register mmio base: 0x80000000
      [  263.683491] [drm] register mmio size: 1048576
      [  263.683493] [drm] PCI I/O BAR is not found.
      [  263.683505] [drm] PCIE atomic ops is not supported
      [  263.685953] [drm] add ip block number 0 <nv_common>
      [  263.685955] [drm] add ip block number 1 <gmc_v10_0>
      [  263.685957] [drm] add ip block number 2 <navi10_ih>
      [  263.685958] [drm] add ip block number 3 <psp>
      [  263.685960] [drm] add ip block number 4 <smu>
      [  263.685962] [drm] add ip block number 5 <gfx_v10_0>
      [  263.685963] [drm] add ip block number 6 <sdma_v5_2>
      [  263.685965] [drm] add ip block number 7 <vcn_v3_0>
      [  263.685966] [drm] add ip block number 8 <jpeg_v3_0>
      [  263.717433] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
      [  263.717437] amdgpu: ATOM BIOS: 113-E438XTX-UO2
      [  263.717449] [drm] VCN(0) decode is enabled in VM mode
      [  263.717450] [drm] VCN(1) decode is enabled in VM mode
      [  263.717452] [drm] VCN(0) encode is enabled in VM mode
      [  263.717453] [drm] VCN(1) encode is enabled in VM mode
      [  263.717456] [drm] JPEG decode is enabled in VM mode
      [  263.717463] [drm] GPU posting now...
      [  263.717519] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
      [  263.717523] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
      [  263.717530] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
      [  263.717575] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [  263.717580] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [  263.717615] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x600403fffffff 64bit pref]
      [  263.717620] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [  263.717624] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [  263.717638] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [  263.717645] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [  263.717649] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [  263.717655] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
      [  263.717667] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
      [  263.717680] pci 0001:00:00.0: PCI bridge to [bus 01-03]
      [  263.717687] pci 0001:00:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [  263.717692] pci 0001:00:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [  263.717699] pci 0001:01:00.0: PCI bridge to [bus 02-03]
      [  263.717708] pci 0001:01:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [  263.717713] pci 0001:01:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [  263.717720] pci 0001:02:00.0: PCI bridge to [bus 03]
      [  263.717727] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [  263.717732] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [  263.717747] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
      [  263.717751] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
      [  263.717755] [drm] Detected VRAM RAM=16368M, BAR=16384M
      [  263.717757] [drm] RAM width 256bits GDDR6
      [  263.717820] [drm] amdgpu: 16368M of VRAM memory ready
      [  263.717827] [drm] amdgpu: 16368M of GTT memory ready.
      [  263.717838] [drm] GART: num cpu pages 8192, num gpu pages 131072
      [  263.717950] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
      [  272.048495] [drm] use_doorbell being set to: [true]
      [  272.048552] [drm] use_doorbell being set to: [true]
      [  272.048605] [drm] use_doorbell being set to: [true]
      [  272.048662] [drm] use_doorbell being set to: [true]
      [  272.048976] [drm] Found VCN firmware Version ENC: 1.3 DEC: 2 VEP: 0 Revision: 17
      [  272.048986] [drm] PSP loading VCN firmware
      [  272.273424] [drm] reserve 0xa00000 from 0x83fe000000 for PSP TMR
      [  272.943503] amdgpu 0001:03:00.0: amdgpu: smu driver if version = 0x00000039, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
      [  272.943507] amdgpu 0001:03:00.0: amdgpu: SMU driver if version not matched
      [  272.943517] amdgpu 0001:03:00.0: amdgpu: use vbios provided pptable
      [  273.018737] amdgpu 0001:03:00.0: amdgpu: SMU is initialized successfully!
      [  273.023894] [drm] kiq ring mec 2 pipe 1 q 0
      [  273.085574] [drm] VCN decode and encode initialized successfully(under DPG Mode).
      [  273.085784] [drm] JPEG decode initialized successfully.
      [  273.086032] kfd kfd: Allocated 3969056 bytes on gart
      [  273.086334] Virtual CRAT table created for GPU
      [  273.086837] amdgpu: Topology: Add dGPU node [0x73bf:0x1002]
      [  273.086845] kfd kfd: added device 1002:73bf
      [  273.086850] amdgpu 0001:03:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, active_cu_number 80
      [  273.087044] amdgpu 0001:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
      [  273.087048] amdgpu 0001:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
      [  273.087051] amdgpu 0001:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
      [  273.087055] amdgpu 0001:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
      [  273.087058] amdgpu 0001:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
      [  273.087062] amdgpu 0001:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
      [  273.087065] amdgpu 0001:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
      [  273.087069] amdgpu 0001:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
      [  273.087072] amdgpu 0001:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
      [  273.087076] amdgpu 0001:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
      [  273.087079] amdgpu 0001:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
      [  273.087083] amdgpu 0001:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
      [  273.087086] amdgpu 0001:03:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
      [  273.087089] amdgpu 0001:03:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
      [  273.087093] amdgpu 0001:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
      [  273.087096] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
      [  273.087100] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
      [  273.087103] amdgpu 0001:03:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
      [  273.087106] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
      [  273.087110] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
      [  273.087113] amdgpu 0001:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
      [  273.094373] EEH: Recovering PHB#1-PE#0
      [  273.094380] EEH: PE location: UOPWR.D100020-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
      [  273.094385] EEH: Frozen PHB#1-PE#0 detected
      [  273.094386] EEH: Call Trace:
      [  273.094393] EEH: [0000000088d68852] __eeh_send_failure_event+0x7c/0x160
      [  273.094396] EEH: [0000000053433783] eeh_dev_check_failure.part.0+0x254/0x5e0
      [  273.094499] EEH: [000000000f3ba7f6] amdgpu_device_rreg+0x180/0x210 [amdgpu]
      [  273.094627] EEH: [0000000069e7642c] mmhub_v2_0_set_clockgating+0x1f8/0x320 [amdgpu]
      [  273.094738] EEH: [00000000a554a501] gmc_v10_0_set_clockgating_state+0x44/0xb0 [amdgpu]
      [  273.094841] EEH: [0000000063a011e7] amdgpu_device_ip_late_init+0x150/0x7d0 [amdgpu]
      [  273.094947] EEH: [00000000294ed418] amdgpu_device_init+0x19a8/0x1fc0 [amdgpu]
      [  273.095051] EEH: [00000000273acd85] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
      [  273.095153] EEH: [00000000f91deff0] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
      [  273.095158] EEH: [0000000028f6d7d4] local_pci_probe+0x68/0x110
      [  273.095161] EEH: [00000000b5bc188e] work_for_cpu_fn+0x38/0x60
      [  273.095163] EEH: [00000000bf267e16] process_one_work+0x300/0x5d0
      [  273.095166] EEH: [00000000ac280537] worker_thread+0x360/0x780
      [  273.095170] EEH: [00000000409ee3ee] kthread+0x1e4/0x1f0
      [  273.095176] EEH: [000000001c930e8a] ret_from_kernel_thread+0x5c/0x6c
      [  273.095178] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
      [  273.095180] EEH: Notify device drivers to shutdown
      [  273.095185] EEH: Beginning: 'error_detected(IO frozen)'
      [  273.356962] [drm] Initialized amdgpu 3.40.0 20150101 for 0001:03:00.0 on minor 1
      [  273.357162] PCI 0001:03:00.0#0000: EEH: Invoking amdgpu->error_detected(IO frozen)
      [  273.357165] [drm] PCI error: detected callback, state(2)!!
      [  273.357588] PCI 0001:03:00.0#0000: EEH: amdgpu driver reports: 'need reset'
      [  273.357593] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
      [  273.357595] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
      [  273.357601] EEH: Collect temporary log
      [  273.357639] EEH: of node=0001:03:00.0
      [  273.357642] EEH: PCI device/vendor: 73bf1002
      [  273.357644] EEH: PCI cmd/status register: 00100546
      [  273.357646] EEH: PCI-E capabilities and status follow:
      [  273.357656] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [  273.357664] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [  273.357665] EEH: PCI-E 20: 00000000 
      [  273.357667] EEH: PCI-E AER capability register set follows:
      [  273.357676] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030 
      [  273.357684] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
      [  273.357691] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
      [  273.357695] EEH: PCI-E AER 30: 00000000 00000000 
      [  273.357697] EEH: of node=0001:03:00.1
      [  273.357700] EEH: PCI device/vendor: ab281002
      [  273.357703] EEH: PCI cmd/status register: 00100546
      [  273.357704] EEH: PCI-E capabilities and status follow:
      [  273.357713] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [  273.357721] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [  273.357722] EEH: PCI-E 20: 00000000 
      [  273.357724] EEH: PCI-E AER capability register set follows:
      [  273.357733] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030 
      [  273.357740] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
      [  273.357748] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
      [  273.357751] EEH: PCI-E AER 30: 00000000 00000000 
      [  273.357754] PHB4 PHB#1 Diag-data (Version: 1)
      [  273.357755] brdgCtl:    00000002
      [  273.357757] RootSts:    00000020 00402000 a0440008 00100107 00001000
      [  273.357759] RootErrSts: 00000000 00008000 00000000
      [  273.357761] PhbSts:     0000001c00000000 0000001c00000000
      [  273.357762] Lem:        0000000100280000 0000000000000000 0000000100000000
      [  273.357764] PhbErr:     0000088000000000 0000008000000000 2148000098000240 a008400000000000
      [  273.357766] RxeArbErr:  8000200000000000 0000200000000000 00009fde30000000 0000000000000000
      [  273.357768] PblErr:     0000000008000000 0000000008000000 0000000000000000 0000000000000000
      [  273.357770] PcieDlp:    0000000000000000 0000000000000000 b000000000000000
      [  273.357771] RegbErr:    0000004000000000 0000004000000000 4800003c00000000 0000000000000200
      [  273.357773] PE[000] A/B: a480002a03000000 8000000000000000
      [  273.357776] EEH: Reset without hotplug activity
      [  273.357779] EEH: Removing 0001:03:00.1 without EEH sensitive driver
      [  273.463561] amdgpu 0001:03:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
      [  273.463564] amdgpu 0001:03:00.0: amdgpu: Failed to enable gfxoff!
      [  273.488713] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
      [  273.948759] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
      [  274.353721] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5
      [  274.353738] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [  274.353755] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [  274.353769] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [  274.353782] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [  274.353795] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [  274.353807] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [  274.389593] [drm] Register(0) [mmUVD_PGFSM_STATUS] failed to reach value 0x00800000 != 0x00c00000
      [  274.389649] [drm:jpeg_v3_0_set_powergating_state [amdgpu]] *ERROR* amdgpu: JPEG enable power gating failed
      [  274.389694] [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <jpeg_v3_0> failed -110
      [  274.403707] amdgpu 0001:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
      [  274.403771] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).
      [  274.625435] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
      [  274.861011] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
      [  275.097223] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
      [  275.332748] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
      [  275.568688] [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
      [  275.804270] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
      [  275.804277] amdgpu 0001:03:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
      [  275.804279] amdgpu 0001:03:00.0: amdgpu: Failed to power gate VCN!
      [  275.804336] [drm:amdgpu_dpm_enable_uvd [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -5. 
      [  276.244073] pci 0001:03:00.1: Removing from iommu group 1
      [  278.395265] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
      [  278.401960] EEH: Sleep 5s ahead of partial hotplug
      [  283.434989] pci 0001:03:00.1: [1002:ab28] type 00 class 0x040300
      [  283.435009] pci 0001:03:00.1: reg 0x10: [mem 0x600c080120000-0x600c080123fff]
      [  283.435067] pci 0001:03:00.1: BAR0 [mem size 0x00004000]: requesting alignment to 0x10000
      [  283.435131] pci 0001:03:00.1: PME# supported from D1 D2 D3hot D3cold
      [  283.435698] pci 0001:03:00.1: can't claim BAR 0 [mem size 0x00004000]: no address assigned
      [  283.435706] pci 0001:03:00.1: BAR 0: assigned [mem 0x600c080120000-0x600c080123fff]
      [  283.435711] pci 0001:02:00.0: PCI bridge to [bus 03]
      [  283.435716] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [  283.435720] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [  283.435731] pci 0001:03:00.1: Added to existing PE#0
      [  283.435738] pci 0001:03:00.1: Adding to iommu group 1
      [  283.435833] pci 0001:03:00.1: D0 power state depends on 0001:03:00.0
      [  283.435903] snd_hda_intel 0001:03:00.1: enabling device (0140 -> 0142)
      [  283.435912] snd_hda_intel 0001:03:00.1: Force to snoop mode by module option
      [  283.435956] EEH: Beginning: 'slot_reset'
      [  283.435961] PCI 0001:03:00.0#0000: EEH: Invoking amdgpu->slot_reset()
      [  283.435963] [drm] PCI error: slot reset callback!!
      [  283.442319] input: HDA ATI HDMI HDMI/DP,pcm=3 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input11
      [  283.442436] input: HDA ATI HDMI HDMI/DP,pcm=7 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input12
      [  283.442513] input: HDA ATI HDMI HDMI/DP,pcm=8 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input13
      [  283.442587] input: HDA ATI HDMI HDMI/DP,pcm=9 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input14
      [  283.442658] input: HDA ATI HDMI HDMI/DP,pcm=10 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input15
      [  283.442730] input: HDA ATI HDMI HDMI/DP,pcm=11 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input16
      [  284.283468] [drm] free PSP TMR buffer
      [  284.304489] amdgpu 0001:03:00.0: amdgpu: GPU reset succeeded, trying to resume
      [  284.304576] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
      [  284.304600] [drm] VRAM is lost due to GPU reset!
      [  284.305078] [drm] PSP is resuming...
      [  284.544795] [drm] reserve 0xa00000 from 0x83fe000000 for PSP TMR
      [  285.204874] amdgpu 0001:03:00.0: amdgpu: SMU is resuming...
      [  285.204882] amdgpu 0001:03:00.0: amdgpu: smu driver if version = 0x00000039, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
      [  285.204885] amdgpu 0001:03:00.0: amdgpu: SMU driver if version not matched
      [  285.275239] amdgpu 0001:03:00.0: amdgpu: failed send message: GetDpmFreqByIndex (31) 	param: 0x000500ff response 0xfffffffb
      [  285.275242] amdgpu 0001:03:00.0: amdgpu: [smu_v11_0_set_single_dpm_table] failed to get dpm levels!
      [  285.275244] amdgpu 0001:03:00.0: amdgpu: Failed to setup default dpm clock tables!
      [  285.275246] amdgpu 0001:03:00.0: amdgpu: Failed to setup default dpm clock tables!
      [  285.275248] amdgpu 0001:03:00.0: amdgpu: Failed to setup smc hw!
      [  285.275315] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -5
      [  285.275397] [drm:amdgpu_pci_slot_reset [amdgpu]] *ERROR* PCIe error recovery failed, err:-5
      [  285.275401] PCI 0001:03:00.0#0000: EEH: amdgpu driver reports: 'disconnect'
      [  285.275406] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
      [  285.275408] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect'
      [  285.275410] EEH: Unable to recover from failure from PHB#1-PE#0.
                     Please try reseating or replacing it
      [  285.275455] EEH: of node=0001:03:00.0
      [  285.275458] EEH: PCI device/vendor: 73bf1002
      [  285.275461] EEH: PCI cmd/status register: 00100546
      [  285.275463] EEH: PCI-E capabilities and status follow:
      [  285.275474] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [  285.275483] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [  285.275484] EEH: PCI-E 20: 00000000 
      [  285.275486] EEH: PCI-E AER capability register set follows:
      [  285.275496] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030 
      [  285.275505] EEH: PCI-E AER 10: 00000000 00002000 000001f4 60008002 
      [  285.275513] EEH: PCI-E AER 20: 000000ff 00060044 00000458 00000000 
      [  285.275517] EEH: PCI-E AER 30: 00000000 00000000 
      [  285.275520] EEH: of node=0001:03:00.1
      [  285.275522] EEH: PCI device/vendor: ab281002
      [  285.275525] EEH: PCI cmd/status register: 00100546
      [  285.275527] EEH: PCI-E capabilities and status follow:
      [  285.275537] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [  285.275545] EEH: PCI-E 10: 11040000 00000000 00000000 00000000 
      [  285.275547] EEH: PCI-E 20: 00000000 
      [  285.275548] EEH: PCI-E AER capability register set follows:
      [  285.275558] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030 
      [  285.275567] EEH: PCI-E AER 10: 00000000 00002000 000001f4 60008002 
      [  285.275575] EEH: PCI-E AER 20: 000000ff 00060044 00000458 00000000 
      [  285.275579] EEH: PCI-E AER 30: 00000000 00000000 
      [  285.275581] PHB4 PHB#1 Diag-data (Version: 1)
      [  285.275582] brdgCtl:    00000002
      [  285.275585] RootSts:    00000020 00402000 a0440008 00100107 00005000
      [  285.275587] RootErrSts: 00000024 00008000 00000000
      [  285.275588] sourceId:   03010000
      [  285.275590] PhbSts:     0000001c00000000 0000001c00000000
      [  285.275592] Lem:        0000000104280000 0000000000000000 0000000100000000
      [  285.275594] PhbErr:     0000088000000000 0000008000000000 2148000098000240 a008400000000000
      [  285.275596] RxeArbErr:  8000200000000020 0000200000000000 00009fde30000000 0000000000000000
      [  285.275598] PblErr:     0000000008000000 0000000008000000 0000000000000000 0000000000000000
      [  285.275600] PcieDlp:    0000000000000000 0000000000000000 b000000000000000
      [  285.275602] RegbErr:    0000004000000000 0000004000000000 4800003c00000000 0000000000000200
      [  285.275604] PE[000] A/B: a480002a03000000 8000000000000000
      [  285.275607] EEH: Beginning: 'error_detected(permanent failure)'
      [  285.275610] PCI 0001:03:00.0#0000: EEH: not actionable (1,1,1)
      [  285.275613] PCI 0001:03:00.1#0000: EEH: not actionable (1,1,1)
      [  285.275615] EEH: Finished:'error_detected(permanent failure)'
      [  286.001810] pci 0001:03:00.1: Removing from iommu group 1
      [  286.001983] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Hotplug removal is not supported
      [  286.002383] amdgpu 0001:03:00.0: amdgpu: amdgpu: finishing device.
      [  290.430911] amdgpu: cp queue pipe 4 queue 0 preemption failed
      [  290.871333] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
      [  290.871376] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
      [  291.201813] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
      [  291.201876] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
      [  292.408325] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
      [  292.408380] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
      [  292.848782] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
      [  292.848846] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
      [  293.179174] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
      [  293.179217] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
      [  293.179225] [drm] free PSP TMR buffer
      [  293.513528] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
      [  293.513593] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
      [  297.650869] BUG: Unable to handle kernel data access on read at 0xf0a803030303a898
      [  297.650872] Faulting instruction address: 0xc000000000cc8298
      [  297.650875] Oops: Kernel access of bad area, sig: 11 [#1]
      [  297.650877] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
      [  297.650879] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter rfkill ip6_tables iptable_filter snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf ipmi_msghandler powernv_flash snd_timer mtd rtc_opal snd opal_prd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper i2c_algo_bit ttm drm_kms_helper syscopyarea
      [  297.650935]  sysfillrect sysimgblt fb_sys_fops cec drm tg3 vmx_crypto i2c_core crc32c_vpmsum drm_panel_orientation_quirks nvme nvme_core sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi fuse scsi_transport_iscsi
      [  297.650959] CPU: 23 PID: 177 Comm: eehd Not tainted 5.10.20-200.fc33.ppc64le #1
      [  297.650961] NIP:  c000000000cc8298 LR: c000000000cc8bb0 CTR: c000000000cc8b30
      [  297.650963] REGS: c000000010e67630 TRAP: 0380   Not tainted  (5.10.20-200.fc33.ppc64le)
      [  297.650965] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 84002822  XER: 00000000
      [  297.650973] CFAR: c000000000cc8bac IRQMASK: 0 
                     GPR00: c000000000cc8bb0 c000000010e678c0 c0000000023dc800 f0a803030303a880 
                     GPR04: 00000000000000c0 00000000c0000000 c00000000303a830 c00000000171f338 
                     GPR08: 003ffff800000201 c00000000171f338 c008000004190000 c008000005f28338 
                     GPR12: c000000000cc8b30 c000000fff6e7000 c0000000001af288 c000000010c704c0 
                     GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                     GPR20: 0000000000000000 c00000001ee96d90 c00000001ee85b70 c00000001ee85b90 
                     GPR24: c00000001ee85b98 c00000001ee85b88 0000000000000000 c0080000060c8dc8 
                     GPR28: 0000000000000003 0000000000000000 c00000001ee80000 f0a803030303a880 
      [  297.651005] NIP [c000000000cc8298] free_fw_priv+0x28/0x280
      [  297.651007] LR [c000000000cc8bb0] release_firmware+0x80/0xe0
      [  297.651009] Call Trace:
      [  297.651011] [c000000010e67930] [c000000000cc8bb0] release_firmware+0x80/0xe0
      [  297.651062] [c000000010e67960] [c008000005b96b48] psp_sw_fini+0x90/0x120 [amdgpu]
      [  297.651116] [c000000010e679a0] [c008000005f1fe48] amdgpu_device_fini+0x3d0/0x630 [amdgpu]
      [  297.651151] [c000000010e67a60] [c008000005acce70] amdgpu_driver_unload_kms+0x1c8/0x330 [amdgpu]
      [  297.651185] [c000000010e67aa0] [c008000005ac08bc] amdgpu_pci_remove+0x64/0xa0 [amdgpu]
      [  297.651189] [c000000010e67b10] [c000000000b3c158] pci_device_remove+0x68/0x120
      [  297.651192] [c000000010e67b50] [c000000000c93688] device_release_driver_internal+0x2f8/0x410
      [  297.651195] [c000000010e67ba0] [c000000000b26668] pci_stop_and_remove_bus_device+0xb8/0x110
      [  297.651198] [c000000010e67be0] [c0000000000732f0] pci_hp_remove_devices+0x90/0x130
      [  297.651201] [c000000010e67c70] [c00000000004e9c0] eeh_handle_normal_event+0x510/0xa40
      [  297.651203] [c000000010e67d50] [c00000000004fdd8] eeh_event_handler+0x118/0x1a0
      [  297.651206] [c000000010e67db0] [c0000000001af464] kthread+0x1e4/0x1f0
      [  297.651208] [c000000010e67e20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
      [  297.651210] Instruction dump:
      [  297.651212] 60000000 4bffffd8 3c4c0171 38424590 7c0802a6 60000000 7c0802a6 fbe1fff8 
      [  297.651218] fbc1fff0 7c7f1b78 f8010010 f821ff91 <ebc30018> 7fc3f378 48601309 60000000 
      [  297.651226] ---[ end trace 87a3804e7d686ea3 ]---

      I speculate that the firmware might be not loaded correctly if the kernel page size is 64K so I try again with a custom 4K page size kernel but the result is the same:

      [   69.457441] amdgpu: Topology: Add CPU node
      [   69.458707] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
      [   69.458717] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
      [   69.458720] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
      [   69.458732] [drm] register mmio base: 0x80000000
      [   69.458733] [drm] register mmio size: 1048576
      [   69.458735] [drm] PCI I/O BAR is not found.
      [   69.458744] [drm] PCIE atomic ops is not supported
      [   69.461020] [drm] add ip block number 0 <nv_common>
      [   69.461022] [drm] add ip block number 1 <gmc_v10_0>
      [   69.461023] [drm] add ip block number 2 <navi10_ih>
      [   69.461025] [drm] add ip block number 3 <psp>
      [   69.461026] [drm] add ip block number 4 <smu>
      [   69.461028] [drm] add ip block number 5 <gfx_v10_0>
      [   69.461029] [drm] add ip block number 6 <sdma_v5_2>
      [   69.461031] [drm] add ip block number 7 <vcn_v3_0>
      [   69.461032] [drm] add ip block number 8 <jpeg_v3_0>
      [   69.492308] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
      [   69.492311] amdgpu: ATOM BIOS: 113-E438XTX-UO2
      [   69.492324] [drm] VCN(0) decode is enabled in VM mode
      [   69.492325] [drm] VCN(1) decode is enabled in VM mode
      [   69.492327] [drm] VCN(0) encode is enabled in VM mode
      [   69.492328] [drm] VCN(1) encode is enabled in VM mode
      [   69.492330] [drm] JPEG decode is enabled in VM mode
      [   69.492336] [drm] GPU posting now...
      [   69.492367] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
      [   69.492370] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
      [   69.492374] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
      [   69.492401] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   69.492404] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   69.492432] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x600403fffffff 64bit pref]
      [   69.492435] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   69.492438] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   69.492447] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   69.492451] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   69.492454] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   69.492458] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
      [   69.492467] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
      [   69.492477] pci 0001:00:00.0: PCI bridge to [bus 01-03]
      [   69.492482] pci 0001:00:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [   69.492485] pci 0001:00:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   69.492490] pci 0001:01:00.0: PCI bridge to [bus 02-03]
      [   69.492495] pci 0001:01:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [   69.492499] pci 0001:01:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   69.492504] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   69.492509] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   69.492512] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   69.492523] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
      [   69.492526] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
      [   69.492529] [drm] Detected VRAM RAM=16368M, BAR=16384M
      [   69.492531] [drm] RAM width 256bits GDDR6
      [   69.492572] [drm] amdgpu: 16368M of VRAM memory ready
      [   69.492577] [drm] amdgpu: 16368M of GTT memory ready.
      [   69.492585] [drm] GART: num cpu pages 131072, num gpu pages 131072
      [   69.499431] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
      [   69.500569] EEH: Recovering PHB#1-PE#0
      [   69.500574] EEH: PE location: UOPWR.D100020-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
      [   69.500576] EEH: Frozen PHB#1-PE#0 detected
      [   69.500578] EEH: Call Trace:
      [   69.500583] EEH: [00000000d9e7d323] __eeh_send_failure_event+0x7c/0x160
      [   69.500588] EEH: [00000000d61ba426] eeh_dev_check_failure.part.0+0x254/0x5e0
      [   69.500693] EEH: [0000000061d1df81] amdgpu_device_rreg+0x180/0x210 [amdgpu]
      [   69.500803] EEH: [00000000ed1fb3ed] gfxhub_v2_1_set_fault_enable_default+0x68/0x150 [amdgpu]
      [   69.500913] EEH: [000000001cce1aab] gmc_v10_0_hw_init+0x198/0x290 [amdgpu]
      [   69.501014] EEH: [0000000009744e54] amdgpu_device_init+0x1a74/0x1fc0 [amdgpu]
      [   69.501110] EEH: [000000005aac3e93] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
      [   69.501204] EEH: [0000000044cf3143] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
      [   69.501208] EEH: [00000000827393ff] local_pci_probe+0x68/0x110
      [   69.501211] EEH: [00000000e5937af3] work_for_cpu_fn+0x38/0x60
      [   69.501214] EEH: [0000000027a7f486] process_one_work+0x300/0x5d0
      [   69.501217] EEH: [0000000041c5aee3] worker_thread+0x360/0x780
      [   69.501219] EEH: [00000000787f3030] kthread+0x1e4/0x1f0
      [   69.501222] EEH: [0000000021927c95] ret_from_kernel_thread+0x5c/0x6c
      [   69.501224] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
      [   69.501225] EEH: Notify device drivers to shutdown
      [   69.501228] EEH: Beginning: 'error_detected(IO frozen)'
      [   69.516456] [drm] use_doorbell being set to: [true]
      [   69.516536] [drm] use_doorbell being set to: [true]
      [   69.516639] [drm] use_doorbell being set to: [true]
      [   69.516739] [drm] use_doorbell being set to: [true]
      [   69.518119] [drm] Found VCN firmware Version ENC: 1.3 DEC: 2 VEP: 0 Revision: 17
      [   69.518135] [drm] PSP loading VCN firmware
      [   69.784609] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
      [   69.784671] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
      [   69.784725] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
      [   69.784727] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
      [   69.784738] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
      [   69.785890] amdgpu: probe of 0001:03:00.0 failed with error -22
      [   69.785920] PCI 0001:03:00.0#0000: EEH: no driver
      [   69.785923] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
      [   69.785926] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
      [   69.785931] EEH: Collect temporary log
      [   69.785972] EEH: of node=0001:03:00.0
      [   69.785976] EEH: PCI device/vendor: 73bf1002
      [   69.785979] EEH: PCI cmd/status register: 00100542
      [   69.785980] EEH: PCI-E capabilities and status follow:
      [   69.785991] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [   69.786000] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [   69.786002] EEH: PCI-E 20: 00000000 
      [   69.786003] EEH: PCI-E AER capability register set follows:
      [   69.786014] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030 
      [   69.786023] EEH: PCI-E AER 10: 00000000 00002000 000001f4 40008001 
      [   69.786033] EEH: PCI-E AER 20: 0000000f 8007f000 00000000 00000000 
      [   69.786036] EEH: PCI-E AER 30: 00000000 00000000 
      [   69.786039] EEH: of node=0001:03:00.1
      [   69.786042] EEH: PCI device/vendor: ab281002
      [   69.786045] EEH: PCI cmd/status register: 00100546
      [   69.786046] EEH: PCI-E capabilities and status follow:
      [   69.786057] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [   69.786065] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [   69.786067] EEH: PCI-E 20: 00000000 
      [   69.786070] EEH: PCI-E AER capability register set follows:
      [   69.786080] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030 
      [   69.786089] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
      [   69.786097] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
      [   69.786101] EEH: PCI-E AER 30: 00000000 00000000 
      [   69.786103] PHB4 PHB#1 Diag-data (Version: 1)
      [   69.786105] brdgCtl:    00000002
      [   69.786107] RootSts:    00000020 00402000 a0440008 00100107 00004000
      [   69.786109] RootErrSts: 00000024 00000000 00000000
      [   69.786110] sourceId:   03000000
      [   69.786112] PhbSts:     0000001c00000000 0000001c00000000
      [   69.786114] Lem:        0000000004000000 0000000000000000 0000000004000000
      [   69.786116] PhbErr:     0000080000000000 0000080000000000 2148000098000240 a008400000000000
      [   69.786120] RxeArbErr:  0000000000000020 0000000000000020 4000030000000000 0000000000000000
      [   69.786122] PcieDlp:    0000000000000000 0000000000000000 7000000000000000
      [   69.786126] PE[000] A/B: 8720002503000000 8000000000000000
      [   69.786128] EEH: Reset with hotplug activity
      [   69.930197] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
      [   70.400246] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
      [   70.825252] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5
      [   70.825264] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [   70.825275] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [   70.825283] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [   70.825291] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [   70.825299] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [   70.825307] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
      [   71.335457] pci 0001:03:00.1: Removing from iommu group 1
      [   71.335661] pci 0001:03:00.0: Removing from iommu group 1
      [   73.513323] EEH: Sleep 5s ahead of complete hotplug
      [   78.547139] pci 0001:03:00.0: [1002:73bf] type 00 class 0x030000
      [   78.547163] pci 0001:03:00.0: reg 0x10: [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   78.547175] pci 0001:03:00.0: reg 0x18: [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   78.547184] pci 0001:03:00.0: reg 0x20: [io  0x0000-0x00ff]
      [   78.547191] pci 0001:03:00.0: reg 0x24: [mem 0x600c080000000-0x600c0800fffff]
      [   78.547199] pci 0001:03:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
      [   78.547330] pci 0001:03:00.0: PME# supported from D1 D2 D3hot D3cold
      [   78.547423] pci 0001:03:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x4 link at 0001:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
      [   78.547495] pci 0001:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
      [   78.547991] pci 0001:03:00.1: [1002:ab28] type 00 class 0x040300
      [   78.548006] pci 0001:03:00.1: reg 0x10: [mem 0x600c080120000-0x600c080123fff]
      [   78.548118] pci 0001:03:00.1: PME# supported from D1 D2 D3hot D3cold
      [   78.548638] pci 0001:02:00.0: ASPM: current common clock configuration is inconsistent, reconfiguring
      [   78.548679] pci 0001:02:00.0: BAR 13: no space for [io  size 0x1000]
      [   78.548681] pci 0001:02:00.0: BAR 13: failed to assign [io  size 0x1000]
      [   78.548686] pci 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   78.548696] pci 0001:03:00.0: BAR 2: assigned [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   78.548706] pci 0001:03:00.0: BAR 5: assigned [mem 0x600c080000000-0x600c0800fffff]
      [   78.548711] pci 0001:03:00.0: BAR 6: assigned [mem 0x600c080100000-0x600c08011ffff pref]
      [   78.548713] pci 0001:03:00.1: BAR 0: assigned [mem 0x600c080120000-0x600c080123fff]
      [   78.548718] pci 0001:03:00.0: BAR 4: no space for [io  size 0x0100]
      [   78.548720] pci 0001:03:00.0: BAR 4: failed to assign [io  size 0x0100]
      [   78.548724] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   78.548728] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   78.548732] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.548736] PCI: No. 2 try to assign unassigned res
      [   78.548740] pci 0001:02:00.0: BAR 13: no space for [io  size 0x1000]
      [   78.548743] pci 0001:02:00.0: BAR 13: failed to assign [io  size 0x1000]
      [   78.548745] pci 0001:03:00.0: BAR 4: no space for [io  size 0x0100]
      [   78.548748] pci 0001:03:00.0: BAR 4: failed to assign [io  size 0x0100]
      [   78.548750] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   78.548755] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   78.548758] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.548770] pci 0001:03:00.0: Added to existing PE#0
      [   78.548776] pci 0001:03:00.0: Adding to iommu group 1
      [   78.548914] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
      [   78.548921] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
      [   78.548925] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
      [   78.548937] [drm] register mmio base: 0x80000000
      [   78.548939] [drm] register mmio size: 1048576
      [   78.548940] [drm] PCI I/O BAR is not found.
      [   78.548947] [drm] PCIE atomic ops is not supported
      [   78.551169] [drm] add ip block number 0 <nv_common>
      [   78.551171] [drm] add ip block number 1 <gmc_v10_0>
      [   78.551173] [drm] add ip block number 2 <navi10_ih>
      [   78.551174] [drm] add ip block number 3 <psp>
      [   78.551176] [drm] add ip block number 4 <smu>
      [   78.551178] [drm] add ip block number 5 <gfx_v10_0>
      [   78.551180] [drm] add ip block number 6 <sdma_v5_2>
      [   78.551181] [drm] add ip block number 7 <vcn_v3_0>
      [   78.551183] [drm] add ip block number 8 <jpeg_v3_0>
      [   78.582437] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
      [   78.582440] amdgpu: ATOM BIOS: 113-E438XTX-UO2
      [   78.582453] [drm] VCN(0) decode is enabled in VM mode
      [   78.582455] [drm] VCN(1) decode is enabled in VM mode
      [   78.582456] [drm] VCN(0) encode is enabled in VM mode
      [   78.582458] [drm] VCN(1) encode is enabled in VM mode
      [   78.582459] [drm] JPEG decode is enabled in VM mode
      [   78.582489] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
      [   78.582491] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
      [   78.582497] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
      [   78.582522] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   78.582525] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   78.582552] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.582555] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   78.582558] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   78.582565] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.582568] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.582571] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.582574] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
      [   78.582584] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
      [   78.582593] pci 0001:00:00.0: PCI bridge to [bus 01-03]
      [   78.582597] pci 0001:00:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [   78.582601] pci 0001:00:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   78.582606] pci 0001:01:00.0: PCI bridge to [bus 02-03]
      [   78.582611] pci 0001:01:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [   78.582615] pci 0001:01:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   78.582620] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   78.582624] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   78.582628] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   78.582639] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
      [   78.582642] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
      [   78.582645] [drm] Detected VRAM RAM=16368M, BAR=16384M
      [   78.582647] [drm] RAM width 256bits GDDR6
      [   78.582826] [drm] amdgpu: 16368M of VRAM memory ready
      [   78.582831] [drm] amdgpu: 16368M of GTT memory ready.
      [   78.582839] [drm] GART: num cpu pages 131072, num gpu pages 131072
      [   78.589574] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
      [   78.596296] [drm] use_doorbell being set to: [true]
      [   78.596663] [drm] use_doorbell being set to: [true]
      [   78.597025] [drm] use_doorbell being set to: [true]
      [   78.597450] [drm] use_doorbell being set to: [true]
      [   78.597861] [drm] Found VCN firmware Version ENC: 1.3 DEC: 2 VEP: 0 Revision: 17
      [   78.597869] [drm] PSP loading VCN firmware
      [   78.853223] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
      [   78.853269] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
      [   78.853306] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
      [   78.853309] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
      [   78.853319] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
      [   78.853350] amdgpu: probe of 0001:03:00.0 failed with error -22
      [   78.853354] pci 0001:03:00.1: Added to existing PE#0
      [   78.853359] pci 0001:03:00.1: Adding to iommu group 1
      [   78.853444] pci 0001:03:00.1: D0 power state depends on 0001:03:00.0
      [   78.853479] snd_hda_intel 0001:03:00.1: enabling device (0140 -> 0142)
      [   78.853484] snd_hda_intel 0001:03:00.1: Force to snoop mode by module option
      [   78.853504] EEH: Notify device driver to resume
      [   78.853506] EEH: Beginning: 'resume'
      [   78.853508] PCI 0001:03:00.0#0000: EEH: no driver
      [   78.853509] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
      [   78.853510] EEH: Finished:'resume'
      [   78.853511] EEH: Recovery successful.
      [   78.853514] EEH: Recovering PHB#1-PE#0
      [   78.853516] EEH: PE location: UOPWR.D100020-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
      [   78.853517] EEH: Frozen PHB#1-PE#0 detected
      [   78.853518] EEH: Call Trace:
      [   78.853522] EEH: [00000000d9e7d323] __eeh_send_failure_event+0x7c/0x160
      [   78.853524] EEH: [00000000d61ba426] eeh_dev_check_failure.part.0+0x254/0x5e0
      [   78.853561] EEH: [0000000061d1df81] amdgpu_device_rreg+0x180/0x210 [amdgpu]
      [   78.853606] EEH: [00000000ed1fb3ed] gfxhub_v2_1_set_fault_enable_default+0x68/0x150 [amdgpu]
      [   78.853651] EEH: [000000001cce1aab] gmc_v10_0_hw_init+0x198/0x290 [amdgpu]
      [   78.853688] EEH: [0000000009744e54] amdgpu_device_init+0x1a74/0x1fc0 [amdgpu]
      [   78.853725] EEH: [000000005aac3e93] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
      [   78.853762] EEH: [0000000044cf3143] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
      [   78.853764] EEH: [00000000827393ff] local_pci_probe+0x68/0x110
      [   78.853766] EEH: [00000000e5937af3] work_for_cpu_fn+0x38/0x60
      [   78.853768] EEH: [0000000027a7f486] process_one_work+0x300/0x5d0
      [   78.853769] EEH: [0000000041c5aee3] worker_thread+0x360/0x780
      [   78.853770] EEH: [00000000787f3030] kthread+0x1e4/0x1f0
      [   78.853772] EEH: [0000000021927c95] ret_from_kernel_thread+0x5c/0x6c
      [   78.853773] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
      [   78.853774] EEH: Notify device drivers to shutdown
      [   78.853775] EEH: Beginning: 'error_detected(IO frozen)'
      [   78.853777] PCI 0001:03:00.0#0000: EEH: no driver
      [   78.853778] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
      [   78.853779] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
      [   78.853782] EEH: Collect temporary log
      [   78.853812] EEH: of node=0001:03:00.0
      [   78.853814] EEH: PCI device/vendor: 73bf1002
      [   78.853816] EEH: PCI cmd/status register: 00100542
      [   78.853817] EEH: PCI-E capabilities and status follow:
      [   78.853824] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [   78.853830] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [   78.853831] EEH: PCI-E 20: 00000000 
      [   78.853832] EEH: PCI-E AER capability register set follows:
      [   78.853839] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030 
      [   78.853845] EEH: PCI-E AER 10: 00000000 00002000 000001f4 40008001 
      [   78.853851] EEH: PCI-E AER 20: 0000000f 8007f000 00000000 00000000 
      [   78.853853] EEH: PCI-E AER 30: 00000000 00000000 
      [   78.853854] EEH: of node=0001:03:00.1
      [   78.853856] EEH: PCI device/vendor: ab281002
      [   78.853858] EEH: PCI cmd/status register: 00100142
      [   78.853859] EEH: PCI-E capabilities and status follow:
      [   78.853866] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04 
      [   78.853871] EEH: PCI-E 10: 11040040 00000000 00000000 00000000 
      [   78.853872] EEH: PCI-E 20: 00000000 
      [   78.853873] EEH: PCI-E AER capability register set follows:
      [   78.853880] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030 
      [   78.853886] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
      [   78.853891] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
      [   78.853894] EEH: PCI-E AER 30: 00000000 00000000 
      [   78.853895] PHB4 PHB#1 Diag-data (Version: 1)
      [   78.853896] brdgCtl:    00000002
      [   78.853897] RootSts:    00000020 00402000 a0440008 00100107 00004000
      [   78.853898] RootErrSts: 00000024 00000000 00000000
      [   78.853899] sourceId:   03000000
      [   78.853900] PhbSts:     0000001c00000000 0000001c00000000
      [   78.853901] Lem:        0000000004000000 0000000000000000 0000000004000000
      [   78.853903] PhbErr:     0000080000000000 0000080000000000 2148000098000240 a008400000000000
      [   78.853904] RxeArbErr:  0000000000000020 0000000000000020 4000030000000000 0000000000000000
      [   78.853905] PcieDlp:    0000000000000000 0000000000000000 7000000000000000
      [   78.853906] PE[000] A/B: 8720002503000000 8000000000000000
      [   78.853908] EEH: Reset with hotplug activity
      [   78.853919] Attempt to iounmap early bolted mapping at 0x0000000000000000
      [   78.853983] pci 0001:03:00.1: Removing from iommu group 1
      [   78.854055] pci 0001:03:00.0: Removing from iommu group 1
      [   80.954155] EEH: Sleep 5s ahead of complete hotplug
      [   85.987779] ------------[ cut here ]------------
      [   85.987788] WARNING: CPU: 0 PID: 177 at arch/powerpc/kernel/eeh_pe.c:438 eeh_pe_tree_remove+0xb8/0x260
      [   85.987789] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink rfkill ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_hdmi snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd snd_timer rtc_opal opal_prd snd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea
      [   85.987888]  sysfillrect sysimgblt fb_sys_fops cec drm vmx_crypto crc32c_vpmsum tg3 i2c_core drm_panel_orientation_quirks nvme nvme_core fuse
      [   85.987907] CPU: 0 PID: 177 Comm: eehd Not tainted 5.10.21-200.4kpagesize.fc33.ppc64le #1
      [   85.987909] NIP:  c00000000004b778 LR: c00000000004b710 CTR: c00000000004ce90
      [   85.987912] REGS: c00000000d14f840 TRAP: 0700   Not tainted  (5.10.21-200.4kpagesize.fc33.ppc64le)
      [   85.987913] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28002842  XER: 00000000
      [   85.987926] CFAR: c00000000004b7b0 IRQMASK: 0 
                     GPR00: c00000000004cee8 c00000000d14fad0 c000000002310900 0000000000000001 
                     GPR04: c000000003ec94b0 c000000003ec94b0 0000000028008844 0000000000000100 
                     GPR08: c00000000d7d4068 0000000000000000 0000000000000008 0000000000000000 
                     GPR12: c00000000004ce90 c0000000024f1000 c0000000001a3be8 c00000000d04fcc0 
                     GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                     GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000045 
                     GPR24: 0000000000000002 0000000000000000 0000000000000000 c00000000d7a1800 
                     GPR28: 5deadbeef0000100 5deadbeef0000122 c00000000d7d0000 c00000000d7d4000 
      [   85.987975] NIP [c00000000004b778] eeh_pe_tree_remove+0xb8/0x260
      [   85.987977] LR [c00000000004b710] eeh_pe_tree_remove+0x50/0x260
      [   85.987979] Call Trace:
      [   85.987982] [c00000000d14fad0] [0000000000000027] 0x27 (unreliable)
      [   85.987987] [c00000000d14fb50] [c00000000004cee8] eeh_pe_detach_dev+0x58/0xc0
      [   85.987990] [c00000000d14fb80] [c00000000004afbc] eeh_pe_traverse+0x6c/0xf0
      [   85.987994] [c00000000d14fbc0] [c00000000004fb54] eeh_reset_device+0x21c/0x2c8
      [   85.987998] [c00000000d14fc70] [c00000000004ebd0] eeh_handle_normal_event+0x7e0/0xa40
      [   85.988001] [c00000000d14fd50] [c00000000004fd18] eeh_event_handler+0x118/0x1a0
      [   85.988005] [c00000000d14fdb0] [c0000000001a3dc4] kthread+0x1e4/0x1f0
      [   85.988009] [c00000000d14fe20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
      [   85.988011] Instruction dump:
      [   85.988013] 67bdf000 639c0100 63bd0122 fb9e0070 fbbe0078 e95f0002 ebdf0038 71490002 
      [   85.988023] 41820038 480000c4 2c290000 40820008 <0fe00000> e93f0068 7c294040 418200dc 
      [   85.988033] ---[ end trace c7c7bf27e0e1201f ]---
      [   85.988035] ------------[ cut here ]------------
      [   85.988039] WARNING: CPU: 0 PID: 177 at arch/powerpc/kernel/eeh_pe.c:438 eeh_pe_tree_remove+0xb8/0x260
      [   85.988040] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink rfkill ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_hdmi snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd snd_timer rtc_opal opal_prd snd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea
      [   85.988131]  sysfillrect sysimgblt fb_sys_fops cec drm vmx_crypto crc32c_vpmsum tg3 i2c_core drm_panel_orientation_quirks nvme nvme_core fuse
      [   85.988148] CPU: 0 PID: 177 Comm: eehd Tainted: G        W         5.10.21-200.4kpagesize.fc33.ppc64le #1
      [   85.988150] NIP:  c00000000004b778 LR: c00000000004b710 CTR: c00000000004ce90
      [   85.988152] REGS: c00000000d14f840 TRAP: 0700   Tainted: G        W          (5.10.21-200.4kpagesize.fc33.ppc64le)
      [   85.988153] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28002842  XER: 00000000
      [   85.988166] CFAR: c00000000004b7b0 IRQMASK: 0 
                     GPR00: c00000000004cee8 c00000000d14fad0 c000000002310900 0000000000000001 
                     GPR04: c000000003ec9e70 c000000003ec9e70 0000000028008844 0000000000000100 
                     GPR08: c00000000d7d4068 0000000000000000 0000000000000008 0000000000000000 
                     GPR12: c00000000004ce90 c0000000024f1000 c0000000001a3be8 c00000000d04fcc0 
                     GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                     GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000045 
                     GPR24: 0000000000000002 0000000000000000 0000000000000000 c00000000d7a1800 
                     GPR28: 5deadbeef0000100 5deadbeef0000122 c00000000d7d0000 c00000000d7d4000 
      [   85.988213] NIP [c00000000004b778] eeh_pe_tree_remove+0xb8/0x260
      [   85.988216] LR [c00000000004b710] eeh_pe_tree_remove+0x50/0x260
      [   85.988217] Call Trace:
      [   85.988219] [c00000000d14fad0] [0000000000000027] 0x27 (unreliable)
      [   85.988223] [c00000000d14fb50] [c00000000004cee8] eeh_pe_detach_dev+0x58/0xc0
      [   85.988227] [c00000000d14fb80] [c00000000004afbc] eeh_pe_traverse+0x6c/0xf0
      [   85.988230] [c00000000d14fbc0] [c00000000004fb54] eeh_reset_device+0x21c/0x2c8
      [   85.988234] [c00000000d14fc70] [c00000000004ebd0] eeh_handle_normal_event+0x7e0/0xa40
      [   85.988237] [c00000000d14fd50] [c00000000004fd18] eeh_event_handler+0x118/0x1a0
      [   85.988240] [c00000000d14fdb0] [c0000000001a3dc4] kthread+0x1e4/0x1f0
      [   85.988244] [c00000000d14fe20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
      [   85.988246] Instruction dump:
      [   85.988248] 67bdf000 639c0100 63bd0122 fb9e0070 fbbe0078 e95f0002 ebdf0038 71490002 
      [   85.988258] 41820038 480000c4 2c290000 40820008 <0fe00000> e93f0068 7c294040 418200dc 
      [   85.988268] ---[ end trace c7c7bf27e0e12020 ]---
      [   85.988318] pci 0001:03:00.0: [1002:73bf] type 00 class 0x030000
      [   85.988340] pci 0001:03:00.0: reg 0x10: [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   85.988352] pci 0001:03:00.0: reg 0x18: [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   85.988359] pci 0001:03:00.0: reg 0x20: [io  0x0000-0x00ff]
      [   85.988367] pci 0001:03:00.0: reg 0x24: [mem 0x600c080000000-0x600c0800fffff]
      [   85.988375] pci 0001:03:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
      [   85.988505] pci 0001:03:00.0: PME# supported from D1 D2 D3hot D3cold
      [   85.988598] pci 0001:03:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x4 link at 0001:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
      [   85.988667] pci 0001:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
      [   85.989164] pci 0001:03:00.1: [1002:ab28] type 00 class 0x040300
      [   85.989178] pci 0001:03:00.1: reg 0x10: [mem 0x600c080120000-0x600c080123fff]
      [   85.989290] pci 0001:03:00.1: PME# supported from D1 D2 D3hot D3cold
      [   85.989808] pci 0001:02:00.0: ASPM: current common clock configuration is inconsistent, reconfiguring
      [   85.989849] pci 0001:02:00.0: BAR 13: no space for [io  size 0x1000]
      [   85.989851] pci 0001:02:00.0: BAR 13: failed to assign [io  size 0x1000]
      [   85.989856] pci 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   85.989866] pci 0001:03:00.0: BAR 2: assigned [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   85.989875] pci 0001:03:00.0: BAR 5: assigned [mem 0x600c080000000-0x600c0800fffff]
      [   85.989880] pci 0001:03:00.0: BAR 6: assigned [mem 0x600c080100000-0x600c08011ffff pref]
      [   85.989883] pci 0001:03:00.1: BAR 0: assigned [mem 0x600c080120000-0x600c080123fff]
      [   85.989887] pci 0001:03:00.0: BAR 4: no space for [io  size 0x0100]
      [   85.989890] pci 0001:03:00.0: BAR 4: failed to assign [io  size 0x0100]
      [   85.989893] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   85.989898] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   85.989902] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   85.989906] PCI: No. 2 try to assign unassigned res
      [   85.989910] pci 0001:02:00.0: BAR 13: no space for [io  size 0x1000]
      [   85.989912] pci 0001:02:00.0: BAR 13: failed to assign [io  size 0x1000]
      [   85.989915] pci 0001:03:00.0: BAR 4: no space for [io  size 0x0100]
      [   85.989917] pci 0001:03:00.0: BAR 4: failed to assign [io  size 0x0100]
      [   85.989920] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   85.989925] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   85.989928] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   85.989940] pci 0001:03:00.0: Added to existing PE#0
      [   85.989946] pci 0001:03:00.0: Adding to iommu group 1
      [   85.990081] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
      [   85.990088] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
      [   85.990092] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
      [   85.990104] [drm] register mmio base: 0x80000000
      [   85.990105] [drm] register mmio size: 1048576
      [   85.990107] [drm] PCI I/O BAR is not found.
      [   85.990113] [drm] PCIE atomic ops is not supported
      [   85.992344] [drm] add ip block number 0 <nv_common>
      [   85.992346] [drm] add ip block number 1 <gmc_v10_0>
      [   85.992347] [drm] add ip block number 2 <navi10_ih>
      [   85.992349] [drm] add ip block number 3 <psp>
      [   85.992351] [drm] add ip block number 4 <smu>
      [   85.992353] [drm] add ip block number 5 <gfx_v10_0>
      [   85.992354] [drm] add ip block number 6 <sdma_v5_2>
      [   85.992356] [drm] add ip block number 7 <vcn_v3_0>
      [   85.992357] [drm] add ip block number 8 <jpeg_v3_0>
      [   86.023918] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
      [   86.023926] amdgpu: ATOM BIOS: 113-E438XTX-UO2
      [   86.023949] [drm] VCN(0) decode is enabled in VM mode
      [   86.023952] [drm] VCN(1) decode is enabled in VM mode
      [   86.023955] [drm] VCN(0) encode is enabled in VM mode
      [   86.023958] [drm] VCN(1) encode is enabled in VM mode
      [   86.023962] [drm] JPEG decode is enabled in VM mode
      [   86.024021] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
      [   86.024024] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
      [   86.024033] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
      [   86.024071] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
      [   86.024075] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
      [   86.024112] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   86.024116] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   86.024120] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   86.024132] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   86.024137] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   86.024142] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   86.024147] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
      [   86.024160] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
      [   86.024174] pci 0001:00:00.0: PCI bridge to [bus 01-03]
      [   86.024180] pci 0001:00:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [   86.024185] pci 0001:00:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   86.024192] pci 0001:01:00.0: PCI bridge to [bus 02-03]
      [   86.024200] pci 0001:01:00.0:   bridge window [mem 0x600c080000000-0x600c0ffefffff]
      [   86.024205] pci 0001:01:00.0:   bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
      [   86.024213] pci 0001:02:00.0: PCI bridge to [bus 03]
      [   86.024219] pci 0001:02:00.0:   bridge window [mem 0x600c080000000-0x600c0807fffff]
      [   86.024225] pci 0001:02:00.0:   bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
      [   86.024240] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
      [   86.024244] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
      [   86.024248] [drm] Detected VRAM RAM=16368M, BAR=16384M
      [   86.024251] [drm] RAM width 256bits GDDR6
      [   86.024256] list_add corruption. prev->next should be next (c00800000067e970), but was 0000000000000000. (prev=c0000000685455b8).
      [   86.024282] ------------[ cut here ]------------
      [   86.024284] kernel BUG at lib/list_debug.c:26!
      [   86.024291] Oops: Exception in kernel mode, sig: 5 [#1]
      [   86.024296] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
      [   86.024300] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink rfkill ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_hdmi snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd snd_timer rtc_opal opal_prd snd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea
      [   86.024426]  sysfillrect sysimgblt fb_sys_fops cec drm vmx_crypto crc32c_vpmsum tg3 i2c_core drm_panel_orientation_quirks nvme nvme_core fuse
      [   86.024454] CPU: 0 PID: 189 Comm: kworker/0:2 Tainted: G        W         5.10.21-200.4kpagesize.fc33.ppc64le #1
      [   86.024461] Workqueue: events work_for_cpu_fn
      [   86.024466] NIP:  c000000000a4a424 LR: c000000000a4a420 CTR: 0000000000000000
      [   86.024470] REGS: c00000000e0fb380 TRAP: 0700   Tainted: G        W          (5.10.21-200.4kpagesize.fc33.ppc64le)
      [   86.024474] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28002444  XER: 20040000
      [   86.024492] CFAR: c000000000216098 IRQMASK: 0 
                     GPR00: c000000000a4a420 c00000000e0fb610 c000000002310900 0000000000000075 
                     GPR04: ffffffffffffffea c000000002099a88 0000000000000001 0000000000000027 
                     GPR08: c000000ffc6dcf90 ffffffffffffffd8 0000000000000023 3030303038303063 
                     GPR12: 0000000000002000 c0000000024f1000 c00000000d14f7b0 c0000000686e5b78 
                     GPR16: c0000000686e5b80 c0000000686e5b70 c0000000686f6d90 c0000000686e5b90 
                     GPR20: c0000000686e5b98 c0000000686e5b88 0000000000000001 c00800000067e970 
                     GPR24: c0080000034ae4c0 0000000000000000 c00000000cf66c58 c0000000686e55d0 
                     GPR28: c00800000067d998 c0000000685455b8 c00800000067e920 c0000000686e55b8 
      [   86.024564] NIP [c000000000a4a424] __list_add_valid+0xb4/0xc0
      [   86.024569] LR [c000000000a4a420] __list_add_valid+0xb0/0xc0
      [   86.024572] Call Trace:
      [   86.024577] [c00000000e0fb610] [c000000000a4a420] __list_add_valid+0xb0/0xc0 (unreliable)
      [   86.024592] [c00000000e0fb670] [c00800000066bf80] ttm_bo_device_init+0x158/0x2d0 [ttm]
      [   86.024728] [c00000000e0fb720] [c008000002ef4214] amdgpu_ttm_init+0xcc/0x620 [amdgpu]
      [   86.024874] [c00000000e0fb830] [c0080000033326d0] amdgpu_bo_init+0x80/0xa0 [amdgpu]
      [   86.025020] [c00000000e0fb8a0] [c008000002f9e750] gmc_v10_0_sw_init+0x338/0x480 [amdgpu]
      [   86.025158] [c00000000e0fb940] [c008000002edb3f8] amdgpu_device_init+0x1670/0x1fc0 [amdgpu]
      [   86.025294] [c00000000e0fba90] [c008000002edf108] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
      [   86.025431] [c00000000e0fbb10] [c008000002ed2a84] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
      [   86.025439] [c00000000e0fbbb0] [c000000000b2d978] local_pci_probe+0x68/0x110
      [   86.025446] [c00000000e0fbc30] [c000000000192ac8] work_for_cpu_fn+0x38/0x60
      [   86.025453] [c00000000e0fbc60] [c000000000197c40] process_one_work+0x300/0x5d0
      [   86.025459] [c00000000e0fbd00] [c000000000198270] worker_thread+0x360/0x780
      [   86.025465] [c00000000e0fbdb0] [c0000000001a3dc4] kthread+0x1e4/0x1f0
      [   86.025472] [c00000000e0fbe20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
      [   86.025476] Instruction dump:
      [   86.025480] f8010070 4b7cbc59 60000000 0fe00000 7c0802a6 3c62ff34 7d465378 7d244b78 
      [   86.025494] 38638bd0 f8010070 4b7cbc35 60000000 <0fe00000> 60000000 60420000 3c4c018c 
      [   86.025512] ---[ end trace c7c7bf27e0e12021 ]---
      Edited 6 months ago by Trung Lê

      Linked issues
      0

          • Trung Lê @trung.le changed title from [navi2] amdgpu module crash to [navi2] amdgpu module crash on RX 6900 XT card 6 months ago

            changed title from to

          • Trung Lê @trung.le changed title from [navi2] amdgpu module crash on RX 6900 XT card to [navi2][5.10.20] amdgpu module crash on RX 6900 XT card 6 months ago

            changed title from to

          • Trung Lê @trung.le changed the description 6 months ago

            changed the description

          • Trung Lê @trung.le changed the description 6 months ago

            changed the description

            • Daniel Pocock
              Daniel Pocock @pocock · 5 months ago

              does this crash happen immediately when loading the module (e.g. using modprobe or insmod) or does it only crash when you try to load Xorg, Wayland or something else?

            • Collapse replies
            • Trung Lê
              Trung Lê @trung.le · 5 months ago

              It crashes instantly when I try with modprobe

            • Please register or sign in to reply
            • Daniel Pocock
              Daniel Pocock @pocock · 5 months ago

              With the RX 5700, if I understand correctly, it was working with some kernels and then there was a regression and it stopped working for users with the 64k page size but continued working for 4k page size. I wonder if locating the root cause of that regression may also contribute to a solution for RX 6800/6900. Is there a bug open for the RX 5700 issue?

            • Collapse replies
            • MPC7500
              MPC7500 💬 @MPC7500 · 5 months ago

              Not Navi, but Vega and 64KiB: #1446

            • Please register or sign in to reply
            • Daniel Pocock
              Daniel Pocock @pocock · 5 months ago

              This forum post, referencing issue #1293, suggests that 5.6 was good and 5.7 was broken https://forums.raptorcs.com/index.php/topic,186.msg1365.html#msg1365

            • Collapse replies
            • Trung Lê
              Trung Lê @trung.le · 5 months ago

              I am confident that there must be regression between major version.

              For example for the Fiji-based card, it works only on 5.6 (64K pages), 5.10 (64K pages). Other versions for example 5.8, 5.9, 5.11 and 5.12 aren't working correctly. However Navi2-based cards are not so lucky, it actually crash on all kernel from 5.6 to 5.12 regardless of page sizes

              Edited by Trung Lê 5 months ago
            • Please register or sign in to reply
          Please register or sign in to reply
          0 Assignees
          None
          Milestone
          None
          Time tracking
          No estimate or time spent
          Due date
          None
          Labels
          None
          Confidentiality
          Not confidential
          Lock issue
          Unlocked
          3 participants
          Trung Lê
          Daniel Pocock
          MPC7500
          Reference: drm/amd#1519