LLMs and Azure OpenAI in Retrieval Augmented Generation (RAG) pattern (preview)

Important

This is a preview feature. This information relates to a prerelease feature that may be substantially modified before it's released. Microsoft makes no warranties, expressed or implied, with respect to the information provided here.

This article offers an illustrative example of using Large Language Models (LLMs) and Azure OpenAI within the context of the Retrieval Augmented Generation (RAG) pattern. Specifically, it explores how you can apply these technologies within Sovereign Landing Zones, while considering important guardrails.

Scenario

A common scenario is to use LLMs to engage in conversations using your own data through the Retrieval Augmented Generation (RAG) pattern. This pattern lets you use the reasoning abilities of LLMs to generate responses based on your specific data without fine-tuning the model. It facilitates the seamless integration of LLMs into your existing business processes or solutions.

Cloud for Sovereignty - AI and LLM Reference Architecture

Microsoft Cloud for Sovereignty provides a reference architecture that illustrates a typical Retrieval Augmented Generation (RAG) architecture within a Sovereign Landing Zone (SLZ). It provides an overview of common and recommended implementation technology choices, terminology, technology principles, common configuration environments, and composition of applicable services.

Reference architecture of AI and LLM configurations with sovereign guardrails.

Download a printable PDF of this reference architecture diagram.

The key stages/dataflow are as follows:

Application landing zones

In the management group hierarchy, these services are placed in a subscription within a nonconfidential management group.

Data sources and transformation pipelines

Data sources and transformation pipelines often exist within an organization for line of business operations. When you integrate LLM applications, such as RAG implementations, with existing data, they become new workloads.

To ensure data flow control, the reference architecture recommends data domain aligned data landing zones for data sources and places data transformation pipelines close to those sources to create data products that are used by LLM applications. This approach ensures precise management of data provisioned to the LLM-based solution, which is hosted separately.

Data transformation components make use of different technologies and services to transform data into a format that can be searched and used by an LLM-based application through semantic or vector search for grounding purposes. These pipelines can work on their own or might use AI services, such as Azure AI services or Azure OpenAI, to transform the data for being put into a vector search or semantic search database.

When using AI services, network peering always makes them available (via the hub or directly) through their private endpoints. For governance, security, and compliance reasons, the data transformation components have the authority to determine what data and in what format is provided to a search database for the LLM workload.

Data transformation components can use various kinds of data sources to offer data with the optimal outcome to the search databases that LLM workloads rely on. These data sources can be SQL databases, data lakes, or even virtual machines hosting custom data solutions, depending on the customer environment.

Data sources shouldn't be accessible directly by the Orchestrator app, but instead these resources should only be available from within the private boundaries of the virtual network. This imposes direct integration of Microsoft Azure Virtual Network (for example, as is the case for VMs), Private Link services, or Virtual Network service endpoints (only if Private Link or direct Virtual Network integration isn't available).

AI and LLM-related components should be hosted as workloads in their own subscription under the Corp or Online management group (depending on whether public access is required or not). These components are:

Azure OpenAI Service encapsulates the operations of LLMs like GPT and Text Embeddings such as Ada, making them accessible through standard APIs provided by Azure OpenAI to the Orchestrator app.

An Orchestrator app acts as the front-end with an API or UX-based interface, and orchestrates the different steps required for building RAG-based experiences. Often, it's a web application or a web API. These steps typically include:

  • pulling data from semantic search engines for prompt grounding
  • pulling data from data sources for prompt grounding
  • correctly chaining different requests to the LLM

The Orchestrator app maintains the history of the requests sent and responses received to ensure that Azure OpenAI service gets grounded based on previous interactions. For example, in a chat-like experience such as ChatGPT or Bing Chat, the Orchestrator app maintains or caches the history of the conversational session so that it's considered in the conversation flow by the LLM service backend.

In an Online environment, the Orchestrator app endpoint should be the only one that is provided through a public endpoint protected by a Web Application Firewall and DDoS protection services. If hosted in a Corp environment, without public endpoints, the Orchestrator is either hosted on a service that is directly integrated into the Virtual Network, such as Virtual Machines or VM Scale Sets, or use services that support Private Link or Virtual Network service endpoints as is the case for Azure App Services.

Search Services provides data from various data sources in a format optimized for efficient use for prompt grounding of LLM services. Microsoft proposes a combination of vectorization and semantic search for achieving the best results for prompt grounding based on search services, supported by Azure AI Search. Using semantic ranking measurably improves search relevance by using language understanding to rank search results. This improves the user experience of RAG applications, as the prompt grounding becomes more accurate through better search results from the search service before sending a request to the LLM.

A combination of AI services might be used to create customized experiences for end users through the Orchestrator, or to optimize data ingestion processes. Imagine using a form recognizer service such as Azure AI Document Intelligence to extract structured information from forms and efficiently process and summarize user inputs. This service can then collaborate with an LLM to summarize the key findings from those recognized form inputs. Another scenario involves using a document recognizer service to convert documents in various formats such as PDFs or word documents into text. Subsequently, an LLM text embedding service can vectorize this recognized text for further analysis.

Private Link services are deployed for all components such that all services are only accessible within the private environment. The only exception might be the Orchestrator app, which if being hosted in an Online landing zone, might be offered publicly behind a Web Application Firewall or comparable services.

Infrastructure components

Infrastructure components can be hosted either as a part of the workload, or centrally in a hub or identity subscription.

The central infrastructure component of a Sovereign Landing Zone implementation is the Platform Connectivity Hub, which is a virtual network provided by every Sovereign Landing Zone deployment. It's placed in the connectivity subscription within the platform management group.

Shared networking components are placed in the hub virtual network. These components typically include:

  • ExpressRoute Circuits or VPN gateways for connectivity to the corporate network of a company, agency, or organization.

  • Firewalls can be implemented using appliances or a combination of Azure Firewall offerings, including Web Application Firewall. These solutions enable traffic inspection, filtering and routing.

  • DDoS protection components for protecting workloads from distributed denial of service attacks.

  • Private DNS Zones for all types of services used across the entire virtual data center landscape implemented with landing zones.

  • Virtual Network Peering for connecting virtual networks of various workloads such as data sources, transformation, and LLM components through the hub network.

  • Policies control traffic flow through the firewalls of the hub where needed.

Considerations

The reference architecture diagram shows a representative example-architecture involving the typical components of an LLM RAG-based workload in the context of a Sovereign Landing Zone. There are several considerations to keep in mind that weren't covered in previous sections.

Alignment with principles from Well-Architected Framework and Cloud Adoption Framework

In previous sections, some alignment aspects related to Well-Architected Framework (WAF) and the Cloud Adoption Framework (CAF) were briefly mentioned. It's important to note that all architectural decisions should be fully aligned with the core principles of CAF and Azure Landing Zones, CAF cloud-scale analytics, and the WAF, including the WAF perspective on Azure OpenAI.

While dealing with guardrails is a standard procedure in landing zone environments, other considerations must be made in several areas for LLM and AI workloads. It's best to follow the Azure Security Baseline compliance and the Sovereignty Baseline Policy initiative standards while designing and defining the infrastructure for the workload subscription.

The top considerations to highlight for LLM RAG-based applications from these standards are:

Data residency and region selection

Sovereignty imposes strict requirements on data residency, and therefore might restrict deployments to specific Azure regions in an SLZ. Selecting a region for LLM workloads is limited by the availability of the services required:

  • Verify that Azure OpenAI and Azure AI Search are both available in your target region where you host your data and your workload for data residency and/or proximity reasons. Additionally, these services are important, from a performance perspective, for the end user's experience of the application.

  • Second, when looking at Azure OpenAI, the availability of the respective LLM models is important since not all models are equally available across all regions.

  • If data sources or other cognitive services aren't available in your designated target region, you might be able to find and operate them in another region in alignment with your data residency requirements. However, Azure OpenAI service and Azure AI Search must be in the same region as the Orchestrator app for performance reasons.

Networking

Public endpoints aren't allowed in Corp environments. Hence, all services must be encapsulated in a Virtual Network. Depending on the service, it might offer direct encapsulation capabilities such as VMs or AKS clusters, Private Link, or Virtual Network service endpoints. Virtual Network service endpoints should be replaced by Private Link wherever possible.

In addition to encapsulation, public access must be disabled for all services. Finally, policy enforcement using Azure Policy should be enabled so that public access can never be enabled accidentally for services where corresponding deny-policies can't be built. The defense-in-depth strategy is to enable the corresponding auditing capabilities for all services.

Encryption at rest and in transit

Most services in Azure support both encryption in transit and at rest capabilities. Enable encryption at rest and in transit across all services when available. Enable the latest TLS version, currently TLS 1.2, for transit encryption.

Managed identities

Use managed identities for all services and service-to-service communication to avoid managing secrets for credentials.

Key rotation in Key Vault

Whenever security assets such as certificates are required, enable key rotation for those secrets in Key Vault to maintain compliance.

Network and Application Security Groups

In a sovereign, secure environment, the use of Network Security Groups (NSG) and Application Security Groups (ASG) is enforced. Missing security groups lead to noncompliant deployments. The usual SSL ports are useful for most services that LLM/RAG workloads rely on since they're based on HTTPS. Specific ports are required for data ingestion from the sources to the search and vector databases. Public IPs aren't allowed in Corp landing zones. All services must be accessible in the Virtual Network only, which requires the use of Private Link or Virtual Network service endpoints for PaaS services.

More security and sovereignty guardrails

The most important and obvious guardrails covered in earlier sections for your infrastructure and application design are reusable even outside of Sovereign or Azure Landing Zones. Other global policies are tied to centrally managed resources such as Log Analytics workspaces or Microsoft Sentinel deployments in enterprise scale landing zones. During your infrastructure and application design process, it's crucial to take into account these centrally managed resources. Neglecting to do so can result in additional effort and time post-deployment. Thankfully, Azure's policy compliance feature can identify non-compliant resources after deployment. Moreover, both Sovereign and Azure Landing Zones provide DeployIfNotExists policies for numerous resources, further simplifying the process.

Some examples of such guardrails are:

  • Activation of logging into centrally managed Log Analytics Workspaces.

  • Integration with Azure Security Center or Microsoft Defender for Cloud.

  • Integration with security information and event management (SIEM) suites such as Microsoft Sentinel.

  • Integration with centrally managed firewalls, Web Application Firewalls, or DDoS protection.

These are just a few guardrails you might identify as requirements after your initial deployment. We recommend that you test your deployments in a test landing zone and iteratively integrate a fulfillment of those guardrails into your infrastructure and application codebase. If that isn't entirely possible, many of these guardrails can be addressed post deployment with DeployIfNotExists policies.

Deploy this scenario

To take advantage of Large Language Models (LLMs) and Azure OpenAI based on Retrieval Augmented Generation (RAG) pattern within Sovereign Landing Zones, you need to first deploy and configure a Sovereign Landing Zone (SLZ) and apply Sovereignty Baseline policy initiatives. For a detailed overview of an SLZ and all its capabilities, see Sovereign Landing Zone documentation on GitHub.

SLZ provides an environment offering guardrails through policies and policy sets, security-enforcement, and consistent baseline infrastructure for deploying workloads and applications. SLZ is based on Azure Landing Zones and extends it with guardrails and security controls specific to sovereignty requirements.

To help accelerate customers' time-to-value while assisting them in meeting their compliance objectives, Microsoft Cloud for Sovereignty includes ready-to-use workload templates that can be consistently deployed and operated in a repeatable manner. The workload templates are aligned with Sovereignty Baseline policy initiativesCloud for Sovereignty policy portfolio, and Azure Landing Zone default policies.

Information Assistant agent template

The Information Assistant agent template provides a starting point for organizations to build their own custom generative AI capability to extend the power of Azure OpenAI to organizational users and their domain data without fine-tuning the model. You can deploy it within the Cloud for Sovereignty and align it with the reference architecture and guidance provided in this article. Information Assistant agent template is compatible with the Sovereign Landing Zone Online management group scope using the secure mode deployment configuration. Support for Corp management group scope is coming soon.

The agent template is a combination of code, documentation, and educational resources provided at no charge to customers and partners that can help accelerate time to value.

For more information on how to deploy and configure the Information Assistant, see the Information Assistant agent template documentation on GitHub. For use cases that you can achieve with this agent template, see Information Assistant Video.