Hi @Pedrojfb
Thank you for your question,
Regarding the single/multiple gunicorn worker per pod, a good rule of thumb is to aim for the smaller and simpler pods to give you more control and granularity, specially for scaling. Based on that, one worker per pod might be ideal, however keep in mind that it really comes down to your application in specific.
Autoscalling for your application would rely on Horizontal Pod Autoscalling (HPA), you can scale your application pods based on one or multiple metrics i.e. have your application pods replicas increase if you have an high CPU utilization or if you have a high number of requests on queue.
For the pods resource metrics those should be made available by the metrics server with Kubernetes, for custom metrics please refer to this kubernetes documentation:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-apis
You can also checkout KEDA, which can give these custom metrics and integrates with Azure and multiple OpenSource tools. Again, it will depend on your application architecture, but with KEDA you would be able to scale your application pods based on e.g. the depth of a request queue.
https://learn.microsoft.com/en-us/azure/aks/keda-integrations
https://keda.sh/docs/2.10/scalers/
In your question you mentioned: "so that when there are too many requests in queue I add more GPUs to process these requests"
Keep in mind that this can only be achieved with Cluster Autoscaler, i.e. HPA will increase the number of pods to the extent of the available cluster resources, it will not create new nodes. https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.