Azure Container Instance takes very long to start. If it even starts

Duv, Samir 0 Reputation points
2023-06-21T07:20:57.9766667+00:00

Hello,

I have an App Service API that takes a request, and starts a Container Group as a background job. The issue is that the Container Instance takes more than half an hour to start. Sometimes it does not seem to ever start. The docker image I use is rather large: ~2.7GB but when the instance finally starts, pulling the image seems to be rather quick. The region of the instance and the registry is East US. The hardware is 1 K80 Gpu and 1 6GB vCpu.

My questions are:

  1. Is this expected behavior?
  2. If it is, what would be an alternative for my use case. I need short bursts of GPU hardware to read a video from storage, analyze it and then store it back in memory
  3. I cannot access any logs until shortly before the instance actually starts. I am using these commands https://learn.microsoft.com/en-us/azure/container-instances/container-instances-get-logs. Is there any other diagnostics available, to see what the issue might be?
Azure Container Registry
Azure Container Registry
An Azure service that provides a registry of Docker and Open Container Initiative images.
511 questions
Azure Container Instances
Azure Container Instances
An Azure service that provides customers with a serverless container experience.
757 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Goncalo Correia 351 Reputation points Microsoft Employee
    2023-06-21T11:02:39.0533333+00:00

    Hi Duv,

    Thank you for your questions,

    From what you described, this delay seems to be related to the availability of the underlay infrastructure as workloads that require GPU resources usually take longer to run,
    As the K80 GPUs are being retired soon, this might be aggravating the issue.
    https://learn.microsoft.com/en-us/azure/container-instances/container-instances-resource-and-quota-limits#gpu-resources-preview

    If this is the issue, and your workloads are pending while infrastructure is being provisioned there are no other logs (besides the ones you mentions) that you can gather.
    If you want to follow up on this, I encourage you to open a Support Request for a more detailed investigation on one of those deployments.

    As an alternative, like the documentation mentions, you can use AKS to provision these jobs. I understand you only need them in short bursts, so consider having a nodepool with GPUs that can autoscale with the requests/necessity of your workloads, to optimize your infrastructure costs.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.