Hello @Irwin , From reliability point of view, load balancing between regions is much better, but load balancing to multiple instances in the same region is reasonable. If you'd like to increate TPM to avoid HTTP 429, however, load balancing between instances in the same region is not appropriate because...
- TPM is defined per model and region.
- Even if multiple instances are in the same region, total TPM per region and model is not changed.
- If all instances are in the same region and same models are deployed onto them, you have to split TPM to each instance (split ratio is up to you).
https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits#quotas-and-limits-reference