ACI With GPU support fails

Egbertn 6 Reputation points
2022-09-12T07:52:53.8+00:00

Hi All,

Using MS docs on how to utilize the GPU feature on Azure, I followed it, and use as a base image
FROM nvidia/cuda:11.4.2-base-ubuntu20.04 AS base

Finally, when I use
New-AzResourceGroupDeployment

it will fail with: ""

Error: failed to start container "yocontainer": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: please update your driver to a newer version, or use an earlier cuda container: unknown

Please advise. Trying earlier cuda container makes no sence, MS Advises this version that I tried, explicitly.

Of course, I tried more versions (all Cuda 11) but all the same.

Below is my ARM template..for GPU I use K80

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"availabilityZones": {
"type": "array"
},
"location": {
"type": "string"
},
"containerName": {
"type": "string"
},

"ASPNETCORE_FORWARDEDHEADERS_ENABLED": {
"type": "bool"
},
"AllowedHosts": {
"type": "string"
},
"imageType": {
"type": "string",
"allowedValues": [
"Public",
"Private"
]
},
"imageName": {
"type": "string"
},
"osType": {
"type": "string",
"allowedValues": [
"Linux",
"Windows"
]
},
"numberCpuCores": {
"type": "string"
},
"memory": {
"type": "string"
},
"restartPolicy": {
"type": "string",
"allowedValues": [
"OnFailure",
"Always",
"Never"
]
},
"imageRegistryLoginServer": {
"type": "string"
},
"imageUsername": {
"type": "string"
},
"imagePassword": {
"type": "securestring"
},
"gpuSku": {
"type": "string"
},
"numberGpuCores": {
"type": "string"
},
"ipAddressType": {
"type": "string"
},
"ports": {
"type": "array"
},
"storageKey": {
"type": "securestring"
},
"ccureimage_cert_pw": {
"type": "securestring"
}
},
"resources": [
{
"location": "westeurope",
"name": "[parameters('containerName')]",
"type": "Microsoft.ContainerInstance/containerGroups",
"apiVersion": "2021-10-01",
"zones": "[parameters('availabilityZones')]",
"properties": {
"containers": [
{
"name": "[parameters('containerName')]",
"properties": {
"image": "[parameters('imageName')]",
"resources": {
"requests": {
"cpu": "[float(parameters('numberCpuCores'))]",
"memoryInGB": "[float(parameters('memory'))]",
"gpu": {
"count": "[int(parameters('numberGpuCores'))]",
"sku": "[parameters('gpuSku')]"
}
}
},
"ports": "[parameters('ports')]",
"volumeMounts": [

],
"environmentVariables": [

],
"osType": "[parameters('osType')]",

"imageRegistryCredentials": [
{
"server": "[parameters('imageRegistryLoginServer')]",
"username": "[parameters('imageUsername')]",
"password": "[parameters('imagePassword')]"
}
],
"ipAddress": {
"type": "[parameters('ipAddressType')]",
"ports": "[parameters('ports')]"
},
"subnetIds": [

]
},
"tags": {},
"dependsOn": [

]
}
]
}

Azure Container Instances
Azure Container Instances
An Azure service that provides customers with a serverless container experience.
636 questions
{count} votes

1 answer

Sort by: Most helpful
  1. kobulloc-MSFT 23,416 Reputation points Microsoft Employee
    2022-09-13T19:09:35.423+00:00

    Hello, @Egbertn !

    How do I use GPU resources (preview) with Azure Container Instances (ACI)?
    Using GPU resources with Azure Container Instances (ACI) is still in preview, so the recommendation is to follow the documentation as closely as possible while keeping in mind that some features may not be implemented and that performance may not yet be at the level of production SLAs.

    Documentation:
    https://learn.microsoft.com/en-us/azure/container-instances/container-instances-gpu

    Following the documentation above, I was able to get both V100 and K80 deployed although it did take a couple attempts due to availability and some intermittent execution errors/timeouts.

    az deployment group create --resource-group myResourceGroup --template-file gpudeploy.json  
    

    K80 modified ARM template:

    {  
        "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",  
        "contentVersion": "1.0.0.0",  
        "parameters": {  
          "containerGroupName": {  
            "type": "string",  
            "defaultValue": "gpucontainergrouprm",  
            "metadata": {  
              "description": "Container Group name."  
            }  
          }  
        },  
        "variables": {  
          "containername": "gpucontainer",  
          "containerimage": "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu"  
        },  
        "resources": [  
          {  
            "name": "[parameters('containerGroupName')]",  
            "type": "Microsoft.ContainerInstance/containerGroups",  
            "apiVersion": "2021-09-01",  
            "location": "[resourceGroup().location]",  
            "properties": {  
                "containers": [  
                {  
                  "name": "[variables('containername')]",  
                  "properties": {  
                    "image": "[variables('containerimage')]",  
                    "resources": {  
                      "requests": {  
                        "cpu": 4.0,  
                        "memoryInGb": 12.0,  
                        "gpu": {  
                            "count": 1,  
                            "sku": "K80"  
                      }  
                    }  
                  }  
                }  
              }  
            ],  
            "osType": "Linux",  
            "restartPolicy": "OnFailure"  
            }  
          }  
        ]  
    }  
    

    Successful deployment of K80 in West Europe after navigating to the directory with the ARM template

    240650-image.png

    0 comments No comments