This section presents the reference architectures of a GenAI gateway for an enterprise that needs to access both Azure OpenAI (AOAI) resources and the custom LLM deployments on their own premises. There could be many possible ways to design a GenAI gateway using a combination of various Azure services. This section demonstrates using Azure API Management (APIM) Service as the main component to build the necessary features for a GenAI gateway solution.
Reference Architectures using Azure API Management
The Azure API Management(APIM) Landing Zone accelerator provides a comprehensive solution to deploy a GenAI gateway using Azure API Management with best practices around security and operational excellence. GenAI gateway using APIM is one of the reference scenario implemented in this accelerator.
Cloud-based GenAI Gateway
This design shows how to use APIM to create a GenAI gateway. It smoothly integrates with AOAI services in the cloud and any on-premises custom LLMs that are deployed and available as REST endpoints.
The architecture incorporates elements that are engineered for batch use cases, with the aim of optimizing PTU utilization described here.
Figure 1: Cloud-Based GenAI using APIM
APIM products and subscription features can enable
various Generative AI scenarios in an enterprise. Different products can
offer the following different functionalities:
Creating content.
Producing embeddings.
Searching.
Subscriptions allow different teams to access these functionalities.
Considerations for the cloud based approach: Keep in mind that the gateway component is cloud-based, meaning the Azure network processes every request before applying gateway policies. This can increase latency for on-premises services. Additionally, ensure proper network setup for inbound connections if LLM models are deployed on-premises.
On-premises GenAI Gateway using APIM Self-Hosted Gateways
Many enterprises would like to use existing in-house capabilities
while also having network constraints to allow inbound connection from
Azure to their internal network.
Azure API Management (APIM) self-hosted gateways
can be used to create a GenAI gateway that seamlessly integrates with
AOAI services and on-premises applications. The
self-hosted APIM gateway acts as a crucial component, bridging AOAI services with the enterprise's internal network.
Figure 2: On-premises Self-Hosted APIM Gateway
With APIM self-hosted gateway, the requests from the enterprise's
internal network stay within the network unless they reach out to the
AOAI resource. This approach enables all the features of the
gateway inside the network and eliminates the need for inbound
connection from the cloud.
The gateway can use any existing on-premises deployment of
queue
for scheduling requests and connect with enterprise-wide monitoring
system.
This queue would enable gateway logs and metrics to be combined with existing
consumer application logs and metrics.
Considerations for the on-premises approach: Organizations must deploy and maintain the self-hosted gateway, ensuring it scales horizontally to handle load and remains elastic for request surges. If using a custom metrics store, they must build their own monitoring and alerting solutions to support the following actions:
Dynamic scheduling of request
Making reports for charge back.
Reference Design for Key Individual Capabilities
The following outlines the reference design for key GenAI gateway capabilities using Azure API Management (APIM) as the foundational technology.
1. Scalability
The Premium tier of APIM provides the capability
to extend a single APIM instance across multiple Azure regions.
1.1 Supporting High Consumer Concurrency
A single Premium Tier APIM service instance is equipped to
do the following actions:
Support multi-region.
Support Multi-Azure AOAI account configurations.
Facilitate efficient traffic routing across various regions.
Ensure support for high consumer concurrency.
The below diagram illustrates this setup, where APIM efficiently routes
traffic to multiple AOAI instances, deployed in distinct regions. This
capability enhances the performance and availability of the service by
using geographical distribution of resources.
Scenario: Managing spikes with Provisioned Throughput Units (PTUs) and Pay As You Go (PAYG) endpoints
This diagram shows the implementation of spillover strategy. This strategy
involves initially routing traffic to PTU-enabled deployments. In
cases where PTU limits are reached, the overflow is redirected to TPM
(Tokens Per Minute)-enabled Azure OpenAI (AOAI) endpoints. This redirection ensures all
requests are processed.
An alternate load-balancing strategy can be implemented by authoring custom policies within APIM. Refer to this implementation of such strategy using custom APIM policies.
2. Performance Efficiency
APIM policies can be used to rate limit based on RPM and TPM.
2.1 Quota Management for Consumers
Different rate limit values can be set for different use cases based on
their subscription IDs. In the below policy snippet, rate limiting is
done based on both RPM and TPM. Throttling is expected when either of
these limits is crossed.
There will be scenarios, where the AOAI responds back with 429s if the TPM is exceeded for the specific deployment. To mitigate the AOAI quota limits, Retries become an essential tool to ensure service availability. The request throttling happens for a window of a few seconds or minutes. A retry strategy with exponential back-offs can be implemented at the Gateway layer. This strategy will ensure that the request is served for the consumers.
APIM with AOAI provides several
options for authentication including the following:
API keys.
Managed identities.
Service principal.
The managed identity approach can be used to
authenticate between APIM and managed identity supported backend Azure services
The managed identity can be given the right access, "Azure AI Service
User" for the AOAI instance as mentioned in How to configure OpenAI.
APIM then transparently authenticates to the backend, that is, AOAI.
3.2 PII and data masking
This diagram shows how PII detection and data masking are enabled using GenAI Gateway. Upon receiving a data request, the information is sent to an Azure function for PII detection. This function can use services like PII detection in Azure AI Language, Microsoft Presidio, or a custom machine learning model to identify PII data. The detected information is then used to mask the request. The masked data is forwarded to Azure APIM, which then sends it to AOAI.
Figure 3: PII and Data Masking
3.3 Data Sovereignty
This diagram shows how data is restricted to customer-specific regions using GenAI Gateway. Each region hosts AI-enabled applications, APIM, and AOAI. Traffic is routed to region-specific APIM and OpenAI using Traffic Manager. APIM routes requests to the region-specific Azure OpenAI instance.
Figure 5: Data Sovereignty via multi-instance APIM
4. Operational Excellence
4.1 Monitoring and Observability
Azure Monitor Integration
With APIM's native integration with Azure Monitor, requests, responses (payload), and APIM metrics can be logged into Azure Monitor. Additionally, Azure Monitor can collect and log metrics from other Azure services like AOAI, making it the default choice for monitoring and observability.
Figure 6: Monitoring using Azure Monitor
Azure Monitor provides a low-code/no-code way of generating insights, but it has some limitations:
Latency can range from 30 seconds to 15 minutes, which is significant for real-time monitoring and decision-making.
Capturing request/response payloads requires configuring the sampling rate in APIM. A high sampling rate can impact APIM throughput and increase latency.
Large payloads may not be fully logged due to log size limitations. Azure Monitor has a log size limit of 32KB (9182 bytes), and if the combined size of all logged headers and payloads exceeds this limit, some logs may not be recorded.
Monitoring via Custom Events
Figure 7: Monitoring using Custom Events
In this approach, requests, responses, and other data from Azure API Management (APIM) can be logged as custom events to a messaging system like Event Hubs. The event stream from Event Hubs can be consumed by other services for near-real-time data aggregation, generating alerts, or performing other actions.
While this approach offers a more near real-time experience, it requires writing custom aggregation services.
5. Cost Optimization
5.1 Tracking Consumption
Emit token metric policy allows users to track the token consumption of AOAI services by emitting the Total Tokens, Prompt Tokens, and Completion Tokens as a custom metric to Application Insights. This metric can be aggregated to generate reports for internal charge-back of the consumers and supports both streaming and non-streaming AOAI responses.
Obtenga información sobre las directivas y características de Azure API Management para administrar las API de IA generativa, como la limitación de volumen de tokens, el equilibrio de carga y el almacenamiento en caché semántico.
Obtenga información sobre los desafíos de acceder directamente a Azure OpenAI u otros modelos de lenguaje y cómo insertar una puerta de enlace le permite aumentar el uso del modelo con funcionalidades adicionales.
Referencia de la directiva llm-emit-token-metric disponible para su uso en Azure API Management. Proporciona ejemplos, configuración y uso de directivas.