Scalability
Scaling Consumers through Request Load Balancing: One of the unique problems that enterprises encounter when they create a GenAI gateway is increasing the number of consumers while there are limits on TPMs and RPMs. Here are some of the situations that could occur and some possible solutions that could be applied at the GenAI gateway.
1. Load Balancing for Multiple Pay-As-You-Go AOAI Instances
Supporting High Consumer Concurrency: To accommodate numerous consumers making LLM requests, it is advisable to segregate these consumers into distinct regions. Since Azure OpenAI quotas are enforced at the regional level, deploying in multiple regions allows these consumers to operate concurrently. The GenAI gateway can facilitate load balancing by distributing their requests across various regions. However, supporting cross-region deployments might introduce latency issues for consumers. Latency issues can be partially mitigated by implementing region affinity, whereby the GenAI gateway routes consumer requests to the nearest regional deployment to the requestor or the regions can be identified by performing benchmarking. In benchmarking, it is ideal to mimic high, normal loads from requestor and can evaluate which of the OpenAI instances are working well.
For example, consider two scenarios, the first with a single deployment region and the second with deployments in two regions. Since quota is per-region, the overall maximum RPM is higher in the second scenario, as shown below.
Description | Single region deployment | Multi region deployment |
---|---|---|
Total TPM limit | 240,000 | RegionA: 240,000 RegionB: 240,000 |
RPM enforced per 1000 TPM | 6 | RegionA: 6 RegionB: 6 |
Total RPM | 1,440 | RegionA: 1,440 RegionB: 1,440 |
Total RPM across all deployments | 1,440 | 2,880 |
In a multi-region deployment scenario, one can get a potentially higher throughput, thus able to process more concurrent requests. Additionally, Azure OpenAI evaluates the requests in a small period (1 sec or 10 sec). Based on the values for these periods, it extrapolates the RPM and TPM and throttles the overflow requests. By using multiple deployments, distributing the load across two or more resources is achievable and thereby have a reduced probability of hitting the enforced limits on the deployments.
2. Managing Spikes on PTUs with PAYG Endpoints
Enterprises often opt for Provisioned Throughput Units (PTUs) with Azure OpenAI (AOAI) for more stable and predictable performance compared to Pay-As-You-Go (PAYG). To handle sudden surges in consumer demand, a 'spillover' strategy can be effective. This strategy involves initially routing traffic to PTU-enabled deployments. In cases where PTU limits are reached, the overflow is redirected to TPM (Tokens Per Minute)-enabled AOAI endpoints. This redirection ensures all requests are processed.
If the PTU endpoint starts responding with 429 as a response code, then the PTU limit has been reached. Reaching the PTU limit can also be determined through a pro-active monitoring of the PTU utilization.
Strategies for Load Balancing across Multiple Azure OpenAI Deployments
Sometimes there are multiple deployments identified as a potential target back by the gateway. When there are multiple potential targets, apply any of the following approaches to perform load balancing of the consumer requests:
Round Robin/Random: The GenAI Gateway can be configured to use the Round Robin algorithm or Random assignments to load balance the requests across the multiple AOAI deployments. This approach is recommended if the TPM limit for each deployment is the same.
Weighted Round Robin: Load balancing the requests can be done across the multiple AOAI deployments. This is done based on the number of TPM limits for each deployment. The GenAI Gateway can be configured to assign a weight to each AOAI deployment, and then route the requests based on the weight. In this case, the weight could be the TPMs allocated for the specific deployment. For example, if there are 2 deployments within a PTU with 80% and 20% token allocation, then the deployment with 20% token allocation should receive only 1 in 5 calls.
Dynamic routing based on AOAI Utilization: AOAI metrics can be used to continuously monitor the utilization of the AOAI resource. The gateway can use this data to dynamically route the request to least utilized resource.