Deploy ML model to Kubernetes + overwrite previous endpoint

Question

I'm building a CI/CD pipeline in Azure DevOps for the deployment of my Machine Learning model to Azure Kubernetes Service. I have the following task in my YAML pipeline file (replaced some of the values with '...'):

- task: AzureCLI@1
  displayName: "Deploy to AKS"
  inputs:
    azureSubscription: '...'
    scriptLocation: inlineScript
    workingDirectory: $(Build.SourcesDirectory)/score
    inlineScript: |
      set -e # fail on error

      az ml model deploy --name 'aks-deploy-test' --model '$(MODEL_NAME):$(get_model.MODEL_VERSION)' \
      --compute-target $(AKS_COMPUTE_NAME) \
      --ic inference_config.yml \
      --dc deployment_config_aks.yml \
      -g ... --workspace-name ... \
      --overwrite -v

When I run the pipeline the first time, it successfully deployed the ML model and I can see the Endpoint in the Azure ML workspace. However, when I try to run the pipeline a second time (to deploy a newer version of the model), I get the error:

Error:
{
  "code": "KubernetesError",
  "statusCode": 400,
  "message": "Kubernetes Deployment Error",
  "details": [
    {
      "code": "Unschedulable",
      "message": "0/6 nodes are available: 4 Insufficient cpu, 6 Insufficient memory."
    },
    {
      "code": "DeploymentFailed",
      "message": "Couldn't schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00.
You can address this error by either adding more nodes, changing the SKU of your nodes or changing the resource requirements of your service.
Please refer to https://aka.ms/debugimage#container-cannot-be-scheduled for more information."
    }
  ]
}

Isn't the --overwrite option in the az ml model deploy command supposed to completely overwrite the current deployment of the model? If so, why am I still getting this error, or is there a better way to deploy a newer version of the ML model to the same AKS cluster?

Answer

@Axel Vandevelde Thanks for the question. The error message literally indicates the issue, none of the nodes have CPUs available.
Can you please add more details about the sku's. If Possible please share the link to the tutorial/documentation you were following.

Enable AppInsights flag will start flowing the logs (for Application monitoring).
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-enable-app-insights

For cluster monitoring you can use log analytics via AKS.
https://learn.microsoft.com/en-us/azure/aks/view-master-logs

Share via

Deploy ML model to Kubernetes + overwrite previous endpoint

1 answer

Your answer