Operational Excellence
Context length is the number of input tokens that the model can handle. The field of LLMs is rapidly developing with models that can support longer and longer context length. Longer context length means a bigger request body. Besides increasing context length, some models have abilities to work with different modes of data. Other models can also produce varied data types likes images and videos.
The design of the GenAI Gateway must account for these advancements. It should efficiently manage large, mixed-content requests and support diverse output types, ensuring versatility and robustness in handling complex LLM functionalities.
Monitoring and Observability are essential for creating robust and fault-tolerant systems. When building a GenAI gateway, it is key to measure and monitor the overall performance. The overall performance includes tracking various facets. The following are examples of facets:
- Error rates
- Total time for requests and responses
- Latency introduced by the gateway layer
- Latency introduced due to cross region calls between gateway and AOAI instances
Before designing for monitoring and observability, some of the crucial aspects that needs to be taken care are:
- What type of information should be recorded (for example, request, response body/header info etc.)?
- Can we log only a sampled set, or do we have to log for all requests/responses?
- What is the time lag introduced by the metric/event collector? The time lag between key events occurring and the processor acquiring them.
- How do the downstream system components depend on this data?
- What level of data freshness is required? Is near-real-time information necessary, or can some latency be tolerated?
- What actions will be taken using this information, such as scaling, throttling or is it for reporting purposes?
- What is the response mode? Streaming or Batch?
This section lists different possibilities of measuring the metrics while interacting with the GenAI resources.
Azure OpenAI Metrics via Azure monitor: The Azure OpenAI service default metrics are available via Azure Monitor. Using these default metrics allows the downstream systems (for example, GenAI gateway) to access these metrics for the following actions:
- Performing custom operations
- Building dashboards
- Setting up alerts
However, it's important to consider the latency involved with Azure Monitor -- typically ranging from 30 seconds to 15 minutes -- for the ingestion and availability of monitoring data to its consumers. This latency factor is a crucial aspect to account for in real-time monitoring and decision-making processes.
Generating Custom Metrics and Logs via GenAI Gateway: There can be scenarios where the enterprises need more information beyond what's exposed via the AOAI metrics. For example, capturing gateway-induced latency and custom business metrics needs more information. Additionally, this information may be needed on a real-time or near-real-time basis by downstream systems to perform the following critical operations:
- Scaling
- Optimization
- Alerting
Here are some suggested approaches on how monitoring and observability can be achieved using GenAI gateway:
Emitting Custom Events to Real-Time Messaging System: The GenAI gateway can intercept requests or responses, and extract relevant information. The relevant information is used to create events and push them asynchronously into real-time messaging systems. Some examples of real-time messaging systems are Kafka and Azure EventHub. These events can be consumed by a streaming event aggregator (for example, Azure Stream Analytics) on a near-real-time basis to do the following activities:
- Populate a data store
- Data for dashboards
- Triggering actions based on certain rules
Emitting Custom Metrics to a Metrics Collector: Alternatively, the GenAI gateway can emit custom metrics to support specific business needs to a metrics collector (with a time-series database). The metric collector can power dashboards, alerts, and other custom functionality etc. Azure Monitor offers mechanisms for emitting and collecting custom metrics. Open-source alternatives like Prometheus can also be implemented, as described in this post.
It's essential to understand that these custom metrics differ significantly from metrics generated by the AOAI service. Hence a careful assessment on when to use what would be crucial.
For a high-level overview for the design, refer to this section.
The GenAI gateway in an enterprise acts as a frontend for all GenAI deployments. It covers both Azure OpenAI and custom LLM deployments either on On-Premises Datacenters or on other cloud providers.
Accessing these differently hosted LLMs may vary in multiple aspects:
- Consumer authentication
- Emitted metrics
- Quota management
- Latency requirements
- Content moderation approaches
Hence, while designing the GenAI gateway, it's crucial to understand the organization's hybrid strategy considering the above-mentioned aspects. This understanding will dictate how the gateway interfaces with various LLMs and other hybrid services, ensuring efficient and secure access while meeting specific operational requirements.
In the rapidly evolving landscape of LLMs, the capability to seamlessly transition between model versions is crucial for several reasons like rapid experimentation, swift adoption of cutting-edge performance improvements or a security upgrades.
The GenAI gateway should support Model Version Management, enabling the smooth integration of new LLM versions while maintaining operational continuity for consumer applications.
The gateway should facilitate the implementation of key Model version management features, such as:
Testing and Rollout: It is common to execute a comprehensive test suite to ensure the performance, reliability, and compatibility of new LLM versions within the existing ecosystem before a broader rollout of model changes. The gateway must support these testing requirements by exposing test-specific endpoints and should also facilitate a controlled rollout to a subset of consumers.
Ease of version upgrades and rollbacks: The gateway must have mechanisms to quickly roll-forward to newer, stable versions or roll back to previous versions in response to any critical issues that may arise post-deployment.
Resilience and fault tolerance are critical aspects of any GenAI gateway design. The gateway should be designed to handle failures gracefully and ensure minimal disruption to consumer applications. The following are some key considerations for building a resilient and fault-tolerant GenAI gateway:
- Backoff and Retry Mechanisms: Implementing backoff and retry mechanisms in the gateway can help manage transient failures and reduce the impact of service disruptions. The gateway should be able to intelligently retry requests based on the type of error and the current load on the system.
- Backup models and fallback strategies: The gateway should have the ability to switch to backup models or fallback strategies in case of model failures or service outages. This ensures that consumer applications can continue to function even when primary models are unavailable.
- Regional Fail-over: The gateway should be designed to support regional failover to ensure high availability and reliability. In the event of a regional outage, the gateway should be able to redirect traffic to alternative regions to minimize downtime.