Scaling an inference FastAPI with GPU Nodes on AKS

Question

Scaling an inference FastAPI with GPU Nodes on AKS

Pedrojfb 41

I have a FastAPI that receives requests from a web app to perform inference on a GPU and then sends the results back to the web app; it receives both images and videos. (simlilar to this https://huggingface.co/spaces/Testys/Human_Detector) Currently, I have this API containerized and running on an AKS GPU Node, but I'm exploring options on how to scale it for thousands of requests at the same time. I want to scale the FastAPI so that when there are too many requests in queue I add more GPUs to process these requests. I'm deploying the API with gunicorn. When it comes to PODS, should I use a single POD with multiple gunicorn workers ? Or should I have a gunicorn worker per POD and also scale the PODS, and what metric would I use to scale PODS ? Would appreciate any ideas and suggestions for this. Thank you!

KarishmaTiwari-MSFT 20,777 Reputation points Microsoft Employee Moderator

2023-04-20T01:52:31.4333333+00:00

@Pedrojfb Checking in to see if the answer below helped.

If you have any questions at all, please let us know in the "comments" and we would be happy to help you. Comment is the fastest way of notifying the experts.

If the answer below has been helpful, we appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community.

Thank you for helping to improve Microsoft Q&A!

Accepted answer

0 additional answers

Your answer

KarishmaTiwari-MSFT 20,777 Reputation points Microsoft Employee Moderator

2023-04-20T01:52:31.4333333+00:00

@Pedrojfb Checking in to see if the answer below helped.

If you have any questions at all, please let us know in the "comments" and we would be happy to help you. Comment is the fastest way of notifying the experts.

If the answer below has been helpful, we appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community.

Thank you for helping to improve Microsoft Q&A!

Answer 1

Hi @Pedrojfb Thank you for your question, Regarding the single/multiple gunicorn worker per pod, a good rule of thumb is to aim for the smaller and simpler pods to give you more control and granularity, specially for scaling. Based on that, one worker per pod might be ideal, however keep in mind that it really comes down to your application in specific.
Autoscalling for your application would rely on Horizontal Pod Autoscalling (HPA), you can scale your application pods based on one or multiple metrics i.e. have your application pods replicas increase if you have an high CPU utilization or if you have a high number of requests on queue. For the pods resource metrics those should be made available by the metrics server with Kubernetes, for custom metrics please refer to this kubernetes documentation: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-apis You can also checkout KEDA, which can give these custom metrics and integrates with Azure and multiple OpenSource tools. Again, it will depend on your application architecture, but with KEDA you would be able to scale your application pods based on e.g. the depth of a request queue. https://learn.microsoft.com/en-us/azure/aks/keda-integrations https://keda.sh/docs/2.10/scalers/ In your question you mentioned: "so that when there are too many requests in queue I add more GPUs to process these requests"

Keep in mind that this can only be achieved with Cluster Autoscaler, i.e. HPA will increase the number of pods to the extent of the available cluster resources, it will not create new nodes. https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

Scaling an inference FastAPI with GPU Nodes on AKS

0 additional answers

Your answer