AKs nodepool upgrade failure on pool with Standard_NC6s_v3 gpu machines

John Macnamara 0 Reputation points
2024-06-24T19:00:14.4433333+00:00

Received the following error when attempting to up to run a node pool upgrade via the azure cli on an existing pool using the Standard_NC6s_v3 machine type:

(OperationNotAllowed) Code="OperationNotAllowed" Message="The 'Placement' option override for the ephemeral OS disk is not supported. Please upgrade the VM Size with desired placement option for provisioning the Ephemeral OS disk."
Code: OperationNotAllowed

Receive the same error when stopping and starting the nodepool via the cli/ui.

This node pool uses an Ephemeral OS Disk type and is running k8s 1.27.13

This issue has arose in the last 2 weeks and does not occur when running upgrades on other gpu based node pools (Standard_NC24ads_A100_v4) or non-gpu based node pools.

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,989 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Ganeshkumar R 590 Reputation points
    2024-06-24T19:36:47.0333333+00:00

    The error you're encountering, (OperationNotAllowed) Code="OperationNotAllowed" Message="The 'Placement' option override for the ephemeral OS disk is not supported", indicates that there is a problem with the configuration of the VM size and the usage of ephemeral OS disks in your Azure Kubernetes Service (AKS) node pool.

    Background

    Ephemeral OS disks are ideal for stateless workloads where you need fast, temporary storage. However, not all VM sizes support ephemeral OS disks with certain placement options.

    Key Points

    1. Ephemeral OS Disks: These disks are stored on the local VM storage, providing lower read/write latency but not supported by all VM sizes in all configurations.
    2. VM Size Compatibility: Not all VM sizes support ephemeral OS disks with all placement options. This seems to be the root cause of your issue.

    Solutions

    1. Check VM Size Compatibility
      • Ensure that the VM size Standard_NC6s_v3 supports ephemeral OS disks with your desired placement option. You can refer to the official Azure VM documentation for compatibility details.
    2. Upgrade VM Size
      • If Standard_NC6s_v3 does not support ephemeral OS disks with the desired placement, you might need to switch to a different VM size that does. For instance, Standard_NC24ads_A100_v4 seems to work as per your description.

    Steps to Troubleshoot and Resolve

    Step 1: Verify Ephemeral OS Disk Support

    1. Check Current Configuration:
      • Use the Azure CLI to check the current VM size and ephemeral OS disk configuration:
        
             az aks nodepool show --resource-group <ResourceGroupName> --cluster-name <AKSClusterName> --name <NodePoolName>
        
        
    2. Verify Ephemeral OS Disk Support:
      • Refer to the official documentation to verify if Standard_NC6s_v3 supports ephemeral OS disks with your desired placement option.

    Step 2: Upgrade VM Size

    1. Change VM Size:
      • If Standard_NC6s_v3 is not supported, upgrade the VM size to a supported one:
        
             az aks nodepool update \
        
               --resource-group <ResourceGroupName> \
        
               --cluster-name <AKSClusterName> \
        
               --name <NodePoolName> \
        
               --node-vm-size Standard_NC24ads_A100_v4
        
        
    2. Validate the Upgrade:
      • Ensure the node pool upgrade is successful and verify the configuration:
        
             az aks nodepool show --resource-group <ResourceGroupName> --cluster-name <AKSClusterName> --name <NodePoolName>
        
        

    Step 3: Reconfigure Node Pool

    1. Stop and Start Node Pool:
      • If you still encounter issues, try stopping and starting the node pool to reset the configuration:
        
             az aks nodepool stop --resource-group <ResourceGroupName> --cluster-name <AKSClusterName> --name <NodePoolName>
        
             az aks nodepool start --resource-group <ResourceGroupName> --cluster-name <AKSClusterName> --name <NodePoolName>
        
        

    Final Considerations

    • Backup and Downtime: Ensure you have backups and are aware of possible downtime during the upgrade process.
    • Review Documentation: Always refer to the latest Azure documentation for any updates on VM sizes and ephemeral OS disk support.
    • Azure Support: If the issue persists, consider reaching out to Azure Support for personalized assistance.

    By following these steps, you should be able to resolve the issue and successfully upgrade your node pool in AKS.

    0 comments No comments