Scaling an inference FastAPI with GPU Nodes on AKS

Pedrojfb 41 Reputation points
2023-04-13T19:57:19.5233333+00:00

I have a FastAPI that receives requests from a web app to perform inference on a GPU and then sends the results back to the web app; it receives both images and videos. (simlilar to this https://huggingface.co/spaces/Testys/Human_Detector) Currently, I have this API containerized and running on an AKS GPU Node, but I'm exploring options on how to scale it for thousands of requests at the same time. I want to scale the FastAPI so that when there are too many requests in queue I add more GPUs to process these requests. I'm deploying the API with gunicorn. When it comes to PODS, should I use a single POD with multiple gunicorn workers ? Or should I have a gunicorn worker per POD and also scale the PODS, and what metric would I use to scale PODS ? Would appreciate any ideas and suggestions for this. Thank you!

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,447 questions
{count} votes

Accepted answer
  1. Goncalo Correia 351 Reputation points Microsoft Employee
    2023-04-18T09:21:21.3766667+00:00

    Hi @Pedrojfb Thank you for your question, Regarding the single/multiple gunicorn worker per pod, a good rule of thumb is to aim for the smaller and simpler pods to give you more control and granularity, specially for scaling. Based on that, one worker per pod might be ideal, however keep in mind that it really comes down to your application in specific.
    Autoscalling for your application would rely on Horizontal Pod Autoscalling (HPA), you can scale your application pods based on one or multiple metrics i.e. have your application pods replicas increase if you have an high CPU utilization or if you have a high number of requests on queue. For the pods resource metrics those should be made available by the metrics server with Kubernetes, for custom metrics please refer to this kubernetes documentation: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-apis You can also checkout KEDA, which can give these custom metrics and integrates with Azure and multiple OpenSource tools. Again, it will depend on your application architecture, but with KEDA you would be able to scale your application pods based on e.g. the depth of a request queue. https://learn.microsoft.com/en-us/azure/aks/keda-integrations https://keda.sh/docs/2.10/scalers/ In your question you mentioned: "so that when there are too many requests in queue I add more GPUs to process these requests"

    Keep in mind that this can only be achieved with Cluster Autoscaler, i.e. HPA will increase the number of pods to the extent of the available cluster resources, it will not create new nodes. https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.