Azure update to Ubuntu 18.04 5.3.0-1022 amdgpu crash

Pietro Incardona 1 Reputation point
2020-06-06T13:29:12.9+00:00

It seems that Azure update on ubuntu 18.04 5.3.0-1022 break rocm-dkms and amd crash on boot. I also tryed to create a completely new vm 18.04 follow the installation instruction of rocm

https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html (rocm-dkms3.5.0)

The result is that the amdgpu driver crash on start visible in serial console on start.

The the old VMs I can select 5.3.0-1020 where the machine boot normaly ... but everytime i start he machine I have to be carefull to select the 5.3.0-1020 in serial console.

Can someone say if experienced the same or know a better workaround or Fix ? ( PS. the 5.3.0-1022 has been rolled today morning )

...... A note on selecting the kernel 5.3.0-1020 boot but because /lib/firmware/5.3.0-1020-azure/amdgpu/vega10_gpu_info.bin is missing ... infact the gpu is not loaded and is unaccessible. (In the serial console we see that [14.248233] amdgpu bdd6:00:00.0: Failed to load gpu_info firmware "amdgpu/vega10_gpu_info.bin")

.... more trying to use /lib/firmware/5.3.0-1022-azure/amdgpu/ firmware on 5.3.0-1020 does not work ... produce the same problems. I am able in general to boot the VM blacklisting amdgpu driver with modprobe.blacklist=amdgpu

This is the main error when I try to boot with amdgpu driver activated

[ 11.589569] amdgpu c7e9:00:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:217 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 11.591955] amdgpu c7e9:00:00.0: amdgpu: in page starting at address 0x000000f400100000 from client 27
[ 11.659995] amdgpu c7e9:00:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:217 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 11.659995] amdgpu c7e9:00:00.0: amdgpu: in page starting at address 0x000000f400101000 from client 27
[ 12.004451] amdgpu c7e9:00:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_2.1.0 test failed (-110)
[ 12.049879] [drm:amdgpu_gfx_enable_kcq [amdgpu]] ERROR KCQ enable failed
[ 12.092608] [drm:amdgpu_device_init [amdgpu]] ERROR hw_init of IP block <gfx_v9_0> failed -110
[ 12.141149] amdgpu c7e9:00:00.0: amdgpu: amdgpu_device_ip_init failed
[ 12.193592] amdgpu c7e9:00:00.0: amdgpu: Fatal error during GPU init

The same go for CenOS 8.1 as soon as rocm-dkms drivers are installed they crash at next reboot

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
5,269 questions
{count} votes

4 answers

Sort by: Most helpful
  1. Ronen Ariely 14,646 Reputation points MVP
    2020-06-06T19:12:08.75+00:00

    Good day,

    I also tryed to create a completely new vm 18.04 follow the installation instruction of rocm

    I think that this make no sense and I must assume that you did not knew that you have templates of the newer versions.

    Instead of creating VM with old version of Ubuntu 18 and after this to fight the upgrad procedure, you should select the new version of Ubuntu 20.

    1) Start to create new virtual machine and click on "Browse all public and private images"

    https://portal.azure.com/#create/Microsoft.VirtualMachine

    2) In the search box enter: ubuntu 20.04

    And... walla... you can find the free version ubuntu server 20.04 LTS and you can find the paid version of Ubuntu Pro 20.4 LTS

    This is much much better option than create old version and upgrade it :-)

    9159-f.png

    By the way, you can get this recommendation from the official site of Ubuntu as well. In the Ubuntu website you have direct links to install these template in the Azure: https://ubuntu.com/azure

    0 comments No comments

  2. Pietro Incardona 1 Reputation point
    2020-06-06T20:46:45.523+00:00

    Thanks for the answer.

    I think that this make no sense and I must assume that you did not knew that you have templates of the newer versions.

    No I did not know ... now I know

    Instead of creating VM with old version of Ubuntu 18 and after this to fight the upgrad procedure, you should select the new version of Ubuntu 20.

    hmmmm ... in reality I do have and do not want to upgrade to 20.04 rocm support 18.04 (as you can read from the instructions) does not support 20.04 as you can also see here
    https://github.com/RadeonOpenCompute/ROCm/issues/1074
    Out of curiosity I give it a try ... and it does not work. (same the problem as the others)


  3. Pietro Incardona 1 Reputation point
    2020-06-07T11:33:49.957+00:00

    1) Ubuntu Server 19.10 and 19.04 fully support rocm according to what I found, and there are templates for that version in the Azure as well. Please check if Ubuntu 19.10 cover your needs

    Do not know what you found, but I refere always to the "official" support and the channels for me are here

    https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

    and here

    https://github.com/RadeonOpenCompute/ROCm

    And there is no mention of neither 19.04 neither 19.10 neither 20.04

    2) according to the link which you provide there is a preview version which should work partially on Ubuntu 20.4

    No so sure what you mean with there is a preview.

    I recomend to move the discussion to forums which forus on rocm instead of the Virtual machine as the issue seems related to the rocm and the Ubuntu version and not to the fact that it is on virtual machine. Right?

    Is a good question and I do not know the answer ... what I know is that I tried also other VMs considered supported by the official channel ( SLES 15 and CentsOS 8.1) and none of them work. The amdgpu driver crash on start in the same manner. Is it the Virtualization ? Rocm ? AMD ? do not know. What I know is that starting from the day I posted. I am not able to make the AMD gpu accessible/work on Azure.

    I will try also to post in rocm thanks for the suggestion

    0 comments No comments

  4. Ronen Ariely 14,646 Reputation points MVP
    2020-06-07T16:17:07.257+00:00

    This is NOT an answer! I write it here in the answer place instead as comment, since the comment have 1000 characters limitation which make me split each message which drives me crazy

    Hi,

    Do not know what you found, but I refere always to the "official" support and the channels for me are here

    I found many posts on the topic and tutorials which show how to use it on version 19.10. Google is a great tool for this task. I have just made another search, and I found this for example, which explicitly say that it is working for his perfect on 198.10 but not on 20.04. I also see this recording on how to install on version 19.10... and these are the first two links which google returned out of many which you can check.

    As said, I am not experts on using ROCm and the links which you provided should be taken into consideration obviously (The . I would not suggest to use a version which is not officially support in production before fully test it. I do not find any mention in the official site but at the same time I did not see any official message that it is not supported on 19.10, so this is up to you to choose. I just try to help :-)

    Note: personally if I needed it then probably I would test it in 19.10 according to what I see online, but not in production before fully test it.

    No so sure what you mean with there is a preview.

    From the Github discussion I understand from Goddard that "Official support for Ubuntu 20.04 hasn't landed yet", which means it is a preview test for version 20.04 - this is at least what I meant :-)

    Is a good question and I do not know the answer

    You are right, It is GREAT question 👍

    I simply think that you will get much more experts to help you in forums that focus on ROCm

    0 comments No comments