An Azure service that is used to provision Windows and Linux virtual machines.
Thank you again for your patience while we worked through this. We worked offline with the customer, and they were able to successfully deploy a RHEL 9 VM with the NVIDIA GPU Driver Extension. Although the VM provisioning took longer than expected but it was finally working as expected.
What worked for the customer
They started fresh with a new RHEL 9 virtual machine. The default 1 GB OS disk was expanded so the full allocated space could be used. After that, they deployed the NVIDIA GPU Driver Extension.
Note: Longer boot times can be normal on N-series Linux VMs during GPU driver initialization. After a stop/deallocate, the first boot can take anywhere from 30 to 60 minutes. This is expected behavior while the drivers initialize.
Helpful tips for future GPU setups
If you are setting up a similar environment, these recommendations can help avoid common issues:
- Use at least a 30 GB OS disk. If the VM was created with a small default disk, expand it from the Azure Portal before proceeding.
- Disable Secure Boot and vTPM at creation time. Leaving them enabled can sometimes cause boot or driver installation hangs.
- Install the NVIDIA extension using the Azure CLI for more consistent results:
az vm extension set --name NvidiaGpuDriverLinux --publisher Microsoft.HpcCompute - If you prefer manual installation, follow the official NVIDIA documentation carefully to avoid common errors such as
nvidia-smifailing after reboot. - Use the Azure Serial Console to monitor the VM during the first reboot. GPU driver initialization often explains what looks like a “stuck” VM.
Supported operating systems and driver guidance:
For CUDA workloads, the most reliable operating systems include:
- RHEL 8.6 through 9.5 Ubuntu 20.04, 22.04, and 24.04 LTS Rocky Linux 8.4
- Use the latest supported NVIDIA driver for your VM size.
- For NC-series VMs, driver support may be capped (for example, at 470.82.01 depending on the GPU generation). Always verify the installation after reboot using nvidia-smi.
- For GRID / vGPU workloads, commonly supported operating systems include:
- RHEL 8.6 through 9.5 Ubuntu 20.04, 22.04, and 24.04 LTS SLES 12 and 15 (SP2 through SP5)
- For example, vGPU 17.55 (R550 branch) is supported in many scenarios, and NVads A10 v5 requires vGPU 14.1 or later.
Additional note for RHEL 9 users
On RHEL 9 specifically, make sure to:
- Update the kernel and DKMS packages, Add the NVIDIA repository, Install the required drivers using:
yum install nvidia-driver-latest-dkms cuda-drivers - Enable persistence mode:
nvidia-smi -pm 1 - Reboot and validate using
nvidia-smi
References:
- https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#supported-distributions-and-drivers
- https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux
The customer has since chosen an alternative solution and no longer requires the VM, so this case is being closed. We’re leaving this summary here for anyone who encounters similar GPU driver setup challenges on Azure N-series Linux VMs.