How to Optimize Images for Deep Learning Deployments on AKS

Question

How to Optimize Images for Deep Learning Deployments on AKS

Amith Adiraju 0

I have a docker image of size ~2GB ( A fast API app, which loads a deep learning model for inference ).

I followed this article : https://learn.microsoft.com/en-us/azure/aks/tutorial-kubernetes-deploy-application?tabs=azure-cli to deploy my app on AKS.

Since my image is large, I chose higher resources on pods, yet my pods always show "Pending" state and terminates after several restarts. My deployment + service file looks like this:




# DEPLOYMENT WITH 1 replica
apiVersion: apps/v1
kind: Deployment

metadata:
  name: vsn-nw

spec:
  replicas: 2
  selector:
    matchLabels:
      app: vsn-nw

  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  
  minReadySeconds: 5 
  
  template:
    metadata:
      labels:
        app: vsn-nw
    
    spec:
      nodeSelector:
        "kubernetes.io/os": linux
      
      containers:
      - name: vsn-nw
        image: ACR IMAGE PATH
        ports:
        - containerPort: 8000
        
        resources:
          requests:
            cpu: 2500m
          limits:
            cpu: 5000m
  
---

# LOADBALANCER SERVICE
apiVersion: v1
kind: Service
metadata:
  name: vsn-nw
spec:
  type: LoadBalancer
  ports:
  - port: 8000
  selector:

    app: vsn-nw

My assumption's that, pod memory is not sufficient to hold my image. I was wondering if there's a way to compress my image size further ? Or any other best practice to work with large image sizes.

( P.S: one of the docker layers ( installing requirements ) takes about 1.1GB and the remaining layers together take 1GB )

vipullag-MSFT 26,487 Reputation points Moderator

2023-05-10T04:35:42.3866667+00:00

Hello Amith Adiraju

Any update on the issue?

Just checking in to see if you got a chance to see previous response.

If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.

2 answers

Your answer

vipullag-MSFT 26,487 Reputation points Moderator

2023-05-10T04:35:42.3866667+00:00

Hello Amith Adiraju

Any update on the issue?

Just checking in to see if you got a chance to see previous response.

If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.

Answer 1

Hello Amith Adiraju

Welcome to Microsoft Q&A Platform, thanks for posting your query here.

Based on the information you provided, it seems like your pods are not able to start due to insufficient resources. You mentioned that you have already allocated higher resources to your pods, but they are still in a "Pending" state and terminate after several restarts.

One possible solution to this issue is to optimize your Docker image size. You mentioned that one of the Docker layers takes about 1.1GB and the remaining layers together take 1GB. This is a large image size, and it can cause issues with pod scheduling and deployment.

Here are some best practices to optimize your Docker image size for deep learning deployments on AKS:

Start with a smaller base image, such as Alpine Linux, instead of a full-fledged operating system.
Try to minimize the number of layers by combining multiple commands into a single RUN statement.
Remove any unnecessary files or directories from your Docker image.
Use multi-stage builds to separate the build environment from the runtime environment.
Use a Docker registry, such as Azure Container Registry (ACR), to store and manage your Docker images. This can help reduce the size of your deployment files and simplify the deployment process.

The above mentioned best practices can significantly reduce the size of your Docker image.

I hope these best practices help you optimize your Docker image size and resolve your deployment issues.

Hope this helps.

If you need further help on this, tag me in a comment.

If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.

Answer 2

Hello Amith Adiraju

Thank you for reaching out.

The fact that the pod stays in Pending state should have no relation with the image size or the requests you set. I can see your requests are for CPU. The other option for which you can set requests (and limits is memory). Even though you have chosen requests for memory, that is not related to the image size.

In Kubernetes, "requests" refer to the minimum amount of resources that a container or pod requires to run. This value is used by the Kubernetes scheduler to allocate resources to the container or pod. If the requested resources are not available, the scheduler will not schedule the container or pod.

These being said, please make sure your node has 2500m for the CPU available.

You can check that by running "kubectl describe node <nodename>" and look for cpu under Allocatable.

Please note that some system pods use part of the resources and there is also some resource reservation from AKS side.

I suspect you are using the default SKU size, Standard_DS2_v2, which has 2 vCPUs but as mentioned, which is less than 2500m (2.5 vCPUs) you set as requests.

Please check your SKU size and consider that, or use lower requests.

If you didn't move forward after the above details, please share the SKU size for your AKS cluster and the events for the Pending pod.

Hopefully this is what you are looking for! If you have additional questions, please let us know in the comments.

If this has been helpful, please take a moment to accept answers as this helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!

User's image

Share via

How to Optimize Images for Deep Learning Deployments on AKS

2 answers

Your answer