Performance Efficiency

1. Improving Consumer Latency

Consumer Latency acts as a critical factor when designing a GenAI gateway solution. This latency refers to the time taken for a user's request to travel from the client to the gateway, then to Azure OpenAI (AOAI) services, and back. Minimizing this latency ensures a responsive and efficient user experience.

Besides the common factors, one key way to reduce consumer latency is choosing AOAI Streaming endpoints. Streaming allows quicker responses to consumers as you can 'stream' the completions and process them in parts before the full completion is finished. Both OpenAI and AOAI use Server Sent Events(SSE)for streaming.

Having said that, here are some of the downsides of using streaming, which should be considered before making this choice. This list comes from OpenAI, but it also applies to AOAI.

Handling streaming responses: The GenAI gateway should have the capability to handle the streaming from the SSE. The gateway must read each chunk of the server sent event from AOAI and only process the "content" portion of it. It needs to stream "content" back to the application and close the connection on stream termination.

2. Quota Management

AOAI's quota feature enables assignment of rate limits to deployments, up to a global limit called "quota." It uses Tokens Per Minute (TPM) and Requests Per Minute (RPM) as units for this consumption management. Read more about the quota management provided by AOAI here.

In large enterprises hosting multiple business applications that access GenAI resources, it's crucial to manage quota distribution to ensure fair usage and optimized resource allocation. Prior to integration with the GenAI gateway, each consumer application should conduct a benchmark assessment of their TPM and RPM requirements. Based on this assessment, the GenAI gateway can then allocate consumers to appropriate backend AOAI resources.

Benchmarking Token Consumption (PTU and PAYG): Benchmarking consumers for TPM and RPM consumption is an important activity. It helps assess and optimize prompt complexity. It also helps anticipate latency in request processing, accurately estimate token usage, and ensure adherence to service limits.

The following steps outline how to conduct a benchmark assessment for TPM and RPM requirements.

  1. Estimate the frequency and volume of requests for each use case and scenario. For example, how many requests per minute, how many tokens per request, how many concurrent users, etc.

  2. Use a load testing tool or script to simulate the expected traffic pattern and measure the actual TPM and RPM consumption. For example, Azure Load Testing service, Apache JMeter, Locust, Artillery, etc.

  3. Analyze the results and identify the maximum and average TPM and RPM values for each use case and scenario. Also, note any errors or failures that occurred during the test.

  4. Compare the results with the available quota and the expected Service Level Agreement (SLA) for the consumer application. Based on these benchmark results, Provisioned Throughput Units (PTUs) can be procured if needed.

Microsoft is developing this tool to help in benchmarking: https://github.com/Azure/azure-openai-benchmark/

Here are some suggestions for the approaches for managing quota at a consumer level.

  • Setting up dedicated endpoints for consumers

    For scenarios with a limited number of consumers, it's recommended to assign dedicated endpoints to individual consumers or groups with similar requirements. The GenAI Gateway should be configured to route traffic to these designated endpoints based on the consumer's identity. This approach is effective for managing a smaller consumer base.

    In this model, quota distribution is determined at the time of endpoint creation, which requires continuous monitoring to ensure efficient utilization of quotas. It's common in such setups for some consumers to underutilize their allocated resources while others may experience a shortage, leading to an overall inefficient consumption of GenAI resources. Therefore, regular assessment and reallocation of quotas may be necessary to maintain optimal balance and effectiveness in resource usage.

    In this scenario, it is worth taking into the best practices of setting up Multi-tenancy for Azure OpenAI for the deployment configuration.

  • Assign rate limits at the consumer level

    An alternative approach is to apply rate limits at the consumer level in the GenAI Gateway. If a consumer surpasses their limit, then the GenAI gateway can do one of the following actions:

    • Restrict access to the GenAI resource until the quota is replenished

    • Degrade the consumer experience based on the defined contract with the consumer.

      This access restriction eliminates the need for deployment separation at the Azure OpenAI level. The consumer then can implement retry logic at their end to have better resiliency.

      Along with these approaches, a GenAI gateway can also be used to enforce rate limiting best practices at the consumer level. These best practices ensure that the consumers adhere to the conventions of setting max_tokens and a small best_of value to avoid draining the tokens.

3. Consumer-Based Request Prioritization

There could be multiple consumers with varying priorities trying to access the AOAI deployments. Since AOAI imposes hard constraints on the number of tokens that can be consumed per second, request prioritization must allow consumers with critical workloads access the GenAI resources first.

Each request can be categorized into different priorities. Low-priority requests can be deferred to a queue until the capacity becomes available. The AOAI resource should be continuously monitored to track the available capacity. As capacity becomes available, an automated process can start executing the requests from the queue. Different approaches to monitor PTU capacity are discussed here.

Leveraging Circuit-Breaker technique to prioritize requests:

The Circuit-Breaker technique in an API gateway can be employed to prioritize requests among consumers during peak loads. By designating certain consumers as prioritized and defaulting others as non-prioritized, the gateway monitors the backends' response code. When the backend returns specific response codes such as 429, implying the backend quota has reached, the circuit-breaker triggers a longer break for non-prioritized consumers. This break temporarily halts their requests to reduce stress on the backend. For prioritized consumers, the break will be shorter, ensuring a quicker resumption of service to maintain responsiveness for critical processes.