Share via

Linux VM will not boot

TMA Cloud 20 Reputation points
2026-02-13T18:53:24.81+00:00

After a Stop, the VM will never boot. It is stuck in the Starting state for a long time (hours).

It worked fine until we installed Nvidia drivers as the VM has GPU's that we need to use to run AI LLMs.

Azure Virtual Machines
Azure Virtual Machines

An Azure service that is used to provision Windows and Linux virtual machines.

{count} votes

Answer accepted by question author
  1. Ankit Yadav 12,205 Reputation points Microsoft External Staff Moderator
    2026-02-17T03:35:45.1033333+00:00

    Thank you again for your patience while we worked through this. We worked offline with the customer, and they were able to successfully deploy a RHEL 9 VM with the NVIDIA GPU Driver Extension. Although the VM provisioning took longer than expected but it was finally working as expected.

    What worked for the customer

    They started fresh with a new RHEL 9 virtual machine. The default 1 GB OS disk was expanded so the full allocated space could be used. After that, they deployed the NVIDIA GPU Driver Extension.

    Note: Longer boot times can be normal on N-series Linux VMs during GPU driver initialization. After a stop/deallocate, the first boot can take anywhere from 30 to 60 minutes. This is expected behavior while the drivers initialize.

    Helpful tips for future GPU setups

    If you are setting up a similar environment, these recommendations can help avoid common issues:

    • Use at least a 30 GB OS disk. If the VM was created with a small default disk, expand it from the Azure Portal before proceeding.
    • Disable Secure Boot and vTPM at creation time. Leaving them enabled can sometimes cause boot or driver installation hangs.
    • Install the NVIDIA extension using the Azure CLI for more consistent results:
        az vm extension set --name NvidiaGpuDriverLinux --publisher Microsoft.HpcCompute
      
    • If you prefer manual installation, follow the official NVIDIA documentation carefully to avoid common errors such as nvidia-smi failing after reboot.
    • Use the Azure Serial Console to monitor the VM during the first reboot. GPU driver initialization often explains what looks like a “stuck” VM.

    Supported operating systems and driver guidance:

    For CUDA workloads, the most reliable operating systems include:

    • RHEL 8.6 through 9.5 Ubuntu 20.04, 22.04, and 24.04 LTS Rocky Linux 8.4
    • Use the latest supported NVIDIA driver for your VM size.
    • For NC-series VMs, driver support may be capped (for example, at 470.82.01 depending on the GPU generation). Always verify the installation after reboot using nvidia-smi.
    • For GRID / vGPU workloads, commonly supported operating systems include:
      • RHEL 8.6 through 9.5 Ubuntu 20.04, 22.04, and 24.04 LTS SLES 12 and 15 (SP2 through SP5)
      • For example, vGPU 17.55 (R550 branch) is supported in many scenarios, and NVads A10 v5 requires vGPU 14.1 or later.

    Additional note for RHEL 9 users

    On RHEL 9 specifically, make sure to:

    • Update the kernel and DKMS packages, Add the NVIDIA repository, Install the required drivers using:
        yum install nvidia-driver-latest-dkms cuda-drivers
      
    • Enable persistence mode:
        nvidia-smi -pm 1
      
    • Reboot and validate using nvidia-smi

    References:

    The customer has since chosen an alternative solution and no longer requires the VM, so this case is being closed. We’re leaving this summary here for anyone who encounters similar GPU driver setup challenges on Azure N-series Linux VMs.

    1 person found this answer helpful.
    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Manish Deshpande 4,450 Reputation points Microsoft External Staff Moderator
    2026-02-13T19:01:13.0533333+00:00

    Hello TMA Cloud

    If your Linux virtual machine (VM) in Azure encounters a boot or disk error, you may need to perform mitigation on the disk itself. A common example would be a failed application update that prevents the VM from being able to boot successfully.

    You can now use Azure Virtual Machine repair commands to change the OS disk for a VM, and you no longer need to delete and recreate the VM.

    Follow these steps to troubleshoot the VM issue:

    1. Launch Azure Cloud Shell
    2. Run az extension add/update
    3. Run az vm repair create
    4. Run az vm repair run, or perform mitigation steps.
    5. Run az vm repair restore

    To view all available VM repair commands and parameters, see az vm repair.

    To run the commands, you need a role that can create the following types of resources in the subscription:

    • Resource Groups
    • Virtual Machines
    • Resource Tags
    • Virtual Networks
    • Network Security Groups
    • Network Interfaces
    • Disks
    • Public IP Addresses (Optional)

    Steps to perform :

    1.Launch Azure Cloud Shell

    Run
    If this is the first time you have used the az vm repair commands, add the vm-repair CLI extension.

    az extension add -n vm-repair
    

    If you have previously used the az vm repair commands, apply any updates to the vm-repair extension.

    az extension update -n vm-repair
    
    
    

    2.Just run az vm repair create it’ll make a copy of the OS disk for the VM that’s not working, spin up a repair VM in a new Resource Group, and attach that disk copy.

    The repair VM will match the size and region of your original VM, and the Resource Group and VM name will stay the same throughout. If your VM uses Azure Disk Encryption, don’t forget to add --unlock-encrypted-vm so the encrypted disk is accessible on the repair VM.

    az vm repair create -g MyResourceGroup -n myVM --repair-username 
    

    3.Run az vm repair run. This command will run the specified repair script on the attached disk via the repair VM. If the troubleshooting guide you are using specified a run-id, use it here. Otherwise, you can use az vm repair list-scripts to see available repair scripts.

    az vm repair run -g MyResourceGroup -n MyVM --run-on-repair --run-id lin-hello-world --verbose
    
    

    4.Run az vm repair restore. This command will swap the repaired OS disk with the original OS disk of the VM.

    az vm repair restore -g MyResourceGroup -n MyVM --verbose
    
    

    5.Verify and enable boot diagnostics

    az vm boot-diagnostics enable --name myVMDeployed --resource-group myResourceGroup --storage https://mystor.blob.core.windows.net/
    
    
    

    Links shared to user on private messages

    https://learn.microsoft.com/en-us/azure/virtual-machines/windows/n-series-driver-setup

    Suggested user to increase the Disk space as the extension was loading due to less disk space of 1 GB.

    Link for reference:
    https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux/repair-linux-vm-using-azure-virtual-machine-repair-commands#code-try-2

    Linux VM repair tool (ALAR)
    https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux/repair-linux-vm-using-ALAR

    Thanks,
    Manish Deshpande.

    0 comments No comments

  2. Q&A Assist
    2026-02-13T18:53:32.92+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    If your Linux VM is stuck in the Starting state after installing NVIDIA drivers, it could be due to issues with the driver installation or configuration. Here are some steps you can take to troubleshoot and potentially resolve the issue:

    1. Access the Serial Console: Use the Azure Serial Console to check for any error messages during the boot process. This can provide insights into what might be causing the boot failure.
    2. Boot into a Previous Kernel: If the issue started after the driver installation, you may want to try booting the VM using a previous kernel version. This can be done through the Azure Serial Console.
    3. Repair the VM: You can use Azure VM repair commands to create a repair VM and attach the OS disk of the affected VM. This allows you to modify the OS disk to fix boot issues.
    4. Check the NVIDIA Driver Installation: Ensure that the NVIDIA drivers were installed correctly. If there were any errors during installation, you may need to reinstall the drivers. You can also refer to the manual installation instructions for the drivers if needed.
    5. Reboot the VM: After making any changes, reboot the VM to see if it resolves the issue.

    If these steps do not resolve the problem, you may need to consider restoring from a backup or snapshot if available.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.