Designing and implementing a gateway solution with Azure OpenAI resources

Article
03/23/2024

1. Purpose

This document is an essential guide for engineering teams tasked with designing and implementing a gateway solution with Azure OpenAI resources. It aims to equip teams with the essential guidance and reference designs required to build a Generative AI (GenAI) gateway. A GenAI gateway can efficiently handle and optimize GenAI resources utilization, facilitate seamless integration, and distribution of workloads across various deployments. It can also enable fine-grained monitoring and billing

It's important to note that this document does not delve into implementation-level details. It instead provides guidance on how teams can achieve the end goal of building a GenAI Gateway with the following key features:

Authentication
Observability
Compliance
Controls

2. Definition & Problem Statement

2.1. Problem Statement

In the tech landscape of applications consuming Large Language Models (LLMs), organizations face the challenge of efficiently federating and managing the GenAI resources. As the demand for diverse and specialized LLMs grow, the need for a centralized solution become more necessary. This centralized solution must seamlessly integrate, optimize, and distribute the workloads across a federated network of GenAI resources. Existing traditional gateway solutions often lack a unified approach for facilitating the federation of GenAI resources. This deficiency can result in suboptimal resource utilization, increased latency, and challenges in managing a collection of AI models.

What is a GenAI Gateway?

A "GenAI gateway" serves as an intelligent interface/middleware that dynamically balances incoming traffic across backend resources to achieve optimizing resource utilization. In addition to load balancing, GenAI Gateway can be equipped with extra capabilities to address the challenges around billing, monitoring etc.

Some key benefits that can be achieved using GenAI gateway:


Figure 1: Key Benefits of GenAI Gateway

2.2. Conceptual architecture of a GenAI gateway

Below is the conceptual architecture depicting high-level components of a GenAI gateway.


Figure 2: Conceptual Architecture

3. Recommended Pre-Reading

It is recommended that readers familiarize themselves with certain key concepts and terminologies of Azure OpenAI. These concepts and terminologies are essential for establishing a foundational understanding.

4. Complexity in Building a GenAI Gateway

Large Language Models (LLMs) are exposed through a REST interface, allowing users to easily call their endpoints. In most large enterprises, the REST resources are typically hidden from the consumers with a Gateway component. By routing requests through a Gateway, enterprises get centralized control over access and usage. Centralized control enables effective implementation of the following policies:

Rate limiting
Authentication
Data privacy controls

In a traditional API Gateway, handling rate limiting and load balancing for resources typically involves regulating the number of requests over time. Commonly techniques like throttling or balancing the load across multiple backends are employed.

However, when using Azure OpenAI resources as a backend, the added dimension of TPMs (Tokens Per Minute) introduces another layer of complexity. This complexity is in ensuring consistent and even load distribution across backends. Therefore, apart from the GenAI gateway needing to ensure the regulation of the quantity of requests, it also must account for the total tokens processed across multiple requests.

5. Key Considerations while Building GenAI Gateway

This other dimension of Tokens per minute constraint forces some changes to the traditional gateway. The inherent nature of these AI endpoints introduces some challenges that need to be addressed.

Here are the key considerations while building a GenAI gateway in alignment with Azure Well Architected Framework.

6. Summary

This document not only provides a foundational understanding of key concepts but also offers practical insights and strategies for implementing a GenAI Gateway. It addresses the challenge of efficiently federating and managing GenAI resources, essential for applications utilizing AOAI (Azure OpenAI) and Custom Large Language Models (LLMs). This document applies the industry standard architected framework to categorizes and address complexities of building a GenAI Gateway. It also provides comprehensive approaches/reference designs that are not only technically sound but also adheres to best practices*.