GenAI Gateway Reference Architectures
1. Reference Architectures using Azure API Management
GenAI gateway reference architecture using APIM covers possible reference architectures using APIM for the GenAI gateway.
2. Explore approaches for maximizing PTU utilization
AOAI resources have a reserved capacity called Provisioned Throughput Units (PTUs). It's advisable to use this capacity as fully as possible since it's already allocated. A single PTU instance can support different use cases for an enterprise. Many of the use cases can be classified as either real-time (high priority) or batch (low-priority). High priority requests need to be handled right away, while low-priority requests can be delayed. A common scenario is to have more batch requests than real-time requests. In this case, batch requests can easily use all the capacity of PTU. So to improve efficiency, there should be constant balancing of capacity between high and low-priority requests to to utilize the PTU's available capacity (a gateway should exist that optimizes PTU utilization). It intelligently divides capacity between high and low-priority requests based on the ratio. The gateway should be able to reduce the throughput of low-priority requests to create space for and prioritize more real-time requests when they arrive.
More detailed approaches are described maximizing PTU utilization.
3. Reference design for key individual capabilities
The section describes the reference design for key GenAI gateway capabilities, which are covered hereGenAI gateway reference architecture using APIM. The APIM service is used as a foundational technology.
4. Alternate reference designs
This section describes some of the alternate design options.