How to Optimize Images for Deep Learning Deployments on AKS

Amith Adiraju 0 Reputation points

I have a docker image of size ~2GB ( A fast API app, which loads a deep learning model for inference ).

I followed this article : to deploy my app on AKS.

Since my image is large, I chose higher resources on pods, yet my pods always show "Pending" state and terminates after several restarts. My deployment + service file looks like this:

apiVersion: apps/v1
kind: Deployment

  name: vsn-nw

  replicas: 2
      app: vsn-nw

      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5 
        app: vsn-nw
        "": linux
      - name: vsn-nw
        image: ACR IMAGE PATH
        - containerPort: 8000
            cpu: 2500m
            cpu: 5000m

apiVersion: v1
kind: Service
  name: vsn-nw
  type: LoadBalancer
  - port: 8000

    app: vsn-nw

My assumption's that, pod memory is not sufficient to hold my image. I was wondering if there's a way to compress my image size further ? Or any other best practice to work with large image sizes.

( P.S: one of the docker layers ( installing requirements ) takes about 1.1GB and the remaining layers together take 1GB )

Azure Container Registry
Azure Container Registry
An Azure service that provides a registry of Docker and Open Container Initiative images.
414 questions
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,069 questions
{count} votes

2 answers

Sort by: Most helpful
  1. vipullag-MSFT 25,441 Reputation points

    Hello Amith Adiraju

    Welcome to Microsoft Q&A Platform, thanks for posting your query here.

    Based on the information you provided, it seems like your pods are not able to start due to insufficient resources. You mentioned that you have already allocated higher resources to your pods, but they are still in a "Pending" state and terminate after several restarts.

    One possible solution to this issue is to optimize your Docker image size. You mentioned that one of the Docker layers takes about 1.1GB and the remaining layers together take 1GB. This is a large image size, and it can cause issues with pod scheduling and deployment.

    Here are some best practices to optimize your Docker image size for deep learning deployments on AKS:

    • Start with a smaller base image, such as Alpine Linux, instead of a full-fledged operating system.
    • Try to minimize the number of layers by combining multiple commands into a single RUN statement.
    • Remove any unnecessary files or directories from your Docker image.
    • Use multi-stage builds to separate the build environment from the runtime environment.
    • Use a Docker registry, such as Azure Container Registry (ACR), to store and manage your Docker images. This can help reduce the size of your deployment files and simplify the deployment process.

    The above mentioned best practices can significantly reduce the size of your Docker image.

    I hope these best practices help you optimize your Docker image size and resolve your deployment issues.

    Hope this helps.

    If you need further help on this, tag me in a comment.

    If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.

    0 comments No comments

  2. Andrei Barbu 2,576 Reputation points Microsoft Employee

    Hello Amith Adiraju

    Thank you for reaching out.

    The fact that the pod stays in Pending state should have no relation with the image size or the requests you set. I can see your requests are for CPU. The other option for which you can set requests (and limits is memory). Even though you have chosen requests for memory, that is not related to the image size.

    In Kubernetes, "requests" refer to the minimum amount of resources that a container or pod requires to run. This value is used by the Kubernetes scheduler to allocate resources to the container or pod. If the requested resources are not available, the scheduler will not schedule the container or pod.

    These being said, please make sure your node has 2500m for the CPU available.

    You can check that by running "kubectl describe node <nodename>" and look for cpu under Allocatable.

    Please note that some system pods use part of the resources and there is also some resource reservation from AKS side.

    I suspect you are using the default SKU size, Standard_DS2_v2, which has 2 vCPUs but as mentioned, which is less than 2500m (2.5 vCPUs) you set as requests.

    Please check your SKU size and consider that, or use lower requests.

    If you didn't move forward after the above details, please share the SKU size for your AKS cluster and the events for the Pending pod.

    Hopefully this is what you are looking for! If you have additional questions, please let us know in the comments.

    If this has been helpful, please take a moment to accept answers as this helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!

    User's image

    0 comments No comments