Episode

FastTrack for Azure Season 3 Ep10: Load Balancing Azure OpenAI instances using APIM and Container

with Andre Dewes, Srini Padala, Chris Ayers

In this session we will show how to effectively load balance Azure OpenAI instances to mitigate throttling challenges (TPM & RPM limitations) using API Management custom policies.

We will also cover load balancing Azure OpenAI instances using a container deployed via Azure Container Apps

Learning objectives

  • Discover strategies to enhance the performance and reliability of Azure OpenAI while minimizing throttling due to quota limitations.

Chapters

  • 00:00 - Welcome and introductions
  • 01:29 - Learning objectives
  • 02:50 - Tokens
  • 05:36 - Azure OpenAI Service quotas and limits
  • 11:16 - Token Per Minute (TPM)
  • 17:58 - Requests Per Minute (RPM)
  • 20:43 - Dynamic Quota
  • 24:35 - Best practices
  • 27:30 - Challenges
  • 30:24 - Load balancing multiple AOAI instances
  • 33:03 - Review challenges
  • 36:38 - Load balancing strategies
  • 40:10 - Load balancing AOAI with Azure API Management
  • 42:05 - Demo
  • 01:22:47 - Summary and conclusion

Connect

Intermediate
AI Engineer
Developer
Azure API Management
Azure Container Apps