It seems that Azure update on ubuntu 18.04 5.3.0-1022 break rocm-dkms and amd crash on boot. I also tryed to create a completely new vm 18.04 follow the installation instruction of rocm
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html (rocm-dkms3.5.0)
The result is that the amdgpu driver crash on start visible in serial console on start.
The the old VMs I can select 5.3.0-1020 where the machine boot normaly ... but everytime i start he machine I have to be carefull to select the 5.3.0-1020 in serial console.
Can someone say if experienced the same or know a better workaround or Fix ? ( PS. the 5.3.0-1022 has been rolled today morning )
...... A note on selecting the kernel 5.3.0-1020 boot but because /lib/firmware/5.3.0-1020-azure/amdgpu/vega10_gpu_info.bin is missing ... infact the gpu is not loaded and is unaccessible. (In the serial console we see that [14.248233] amdgpu bdd6:00:00.0: Failed to load gpu_info firmware "amdgpu/vega10_gpu_info.bin")
.... more trying to use /lib/firmware/5.3.0-1022-azure/amdgpu/ firmware on 5.3.0-1020 does not work ... produce the same problems. I am able in general to boot the VM blacklisting amdgpu driver with modprobe.blacklist=amdgpu
This is the main error when I try to boot with amdgpu driver activated
[ 11.589569] amdgpu c7e9:00:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:217 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 11.591955] amdgpu c7e9:00:00.0: amdgpu: in page starting at address 0x000000f400100000 from client 27
[ 11.659995] amdgpu c7e9:00:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:217 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 11.659995] amdgpu c7e9:00:00.0: amdgpu: in page starting at address 0x000000f400101000 from client 27
[ 12.004451] amdgpu c7e9:00:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_2.1.0 test failed (-110)
[ 12.049879] [drm:amdgpu_gfx_enable_kcq [amdgpu]] ERROR KCQ enable failed
[ 12.092608] [drm:amdgpu_device_init [amdgpu]] ERROR hw_init of IP block <gfx_v9_0> failed -110
[ 12.141149] amdgpu c7e9:00:00.0: amdgpu: amdgpu_device_ip_init failed
[ 12.193592] amdgpu c7e9:00:00.0: amdgpu: Fatal error during GPU init
The same go for CenOS 8.1 as soon as rocm-dkms drivers are installed they crash at next reboot