Hello Amith Adiraju
Welcome to Microsoft Q&A Platform, thanks for posting your query here.
Based on the information you provided, it seems like your pods are not able to start due to insufficient resources. You mentioned that you have already allocated higher resources to your pods, but they are still in a "Pending" state and terminate after several restarts.
One possible solution to this issue is to optimize your Docker image size. You mentioned that one of the Docker layers takes about 1.1GB and the remaining layers together take 1GB. This is a large image size, and it can cause issues with pod scheduling and deployment.
Here are some best practices to optimize your Docker image size for deep learning deployments on AKS:
- Start with a smaller base image, such as Alpine Linux, instead of a full-fledged operating system.
- Try to minimize the number of layers by combining multiple commands into a single RUN statement.
- Remove any unnecessary files or directories from your Docker image.
- Use multi-stage builds to separate the build environment from the runtime environment.
- Use a Docker registry, such as Azure Container Registry (ACR), to store and manage your Docker images. This can help reduce the size of your deployment files and simplify the deployment process.
The above mentioned best practices can significantly reduce the size of your Docker image.
I hope these best practices help you optimize your Docker image size and resolve your deployment issues.
Hope this helps.
If you need further help on this, tag me in a comment.
If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.