Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article provides a basic architecture to help you learn how to run chat applications by using Microsoft Foundry and Azure OpenAI in Microsoft Foundry models. The architecture includes a client user interface (UI) that runs in Azure App Service. To fetch grounding data for the language model, the UI uses an agent hosted in Foundry Agent Service to orchestrate the workflow from incoming prompts to data stores. The architecture runs in a single region.
Important
This architecture isn't for production. It's an introductory architecture for learning and proof of concept (POC) purposes. When you design production chat applications, use the Baseline Microsoft Foundry chat reference architecture, which adds production design decisions.
Important
An example implementation supports this guidance. It includes deployment steps for a basic end-to-end chat implementation. You can use this implementation as a foundation for your POC to work with chat applications that use Foundry agents.
Architecture
Download a Visio file of this architecture.
Workflow
The following workflow corresponds to the previous diagram:
An application user interacts with a web application that contains chat functionality. They issue an HTTPS request to the App Service default domain on
azurewebsites.net. This domain automatically points to the App Service built-in public IP address. The Transport Layer Security connection is established from the client directly to App Service. Azure fully manages the certificate.The App Service feature called Easy Auth ensures that the user who accesses the website is authenticated via Microsoft Entra ID.
The application code deployed to App Service handles the request and renders a chat UI for the application user. The chat UI code connects to APIs hosted in the same App Service instance. The API code connects to an agent in Foundry Agent Service by using the Azure AI Persistent Agents SDK.
Foundry Agent Service connects to Azure AI Search or requests up-to-date public knowledge to fetch grounding data for the query. The grounding data is added to the prompt that's sent to the model in the next step.
Foundry Agent Service connects to an Azure OpenAI model that's deployed in Microsoft Foundry and sends the prompt that includes relevant grounding data and chat context.
Application Insights logs information about the original request to App Service and agent interactions.
Components
Many of this architecture's components are the same as the basic App Service web application architecture because the chat UI is based on that architecture. This section highlights data services, components that you can use to build and orchestrate chat flows, and services that expose language models.
Microsoft Foundry is a platform that you use to build, test, and deploy AI solutions and models as a service (MaaS). This architecture uses Foundry to deploy an Azure OpenAI model.
Foundry projects establish connections to data sources, define agents, and invoke deployed models, including Azure OpenAI models. This architecture has only one Foundry project within the Foundry account.
Foundry Agent Service is a capability hosted in Foundry. You use this service to define and host agents to handle chat requests. It manages chat threads, orchestrates tool calls, enforces content safety, and integrates with identity, networking, and observability systems. In this architecture, Foundry Agent Service orchestrates the flow that fetches grounding data from AI Search and other connected knowledge sources and passes it with the prompt to the deployed model.
The agents defined in Foundry Agent Service are codeless and effectively nondeterministic. Your agent's system prompt, combined with
temperatureandtop_pparameters, and constrained knowledge connections define how the agent behave for all requests.Foundry Models allow you to deploy flagship models, including OpenAI models, from the Azure AI catalog in a Microsoft-hosted environment. This approach is considered a MaaS deployment. This architecture deploys models by using the Global Standard configuration with a fixed quota.
AI Search is a cloud search service that supports full-text search, semantic search, vector search, and hybrid search. This architecture includes AI Search because it's commonly used in orchestrations behind chat applications. You use AI Search to retrieve indexed data relevant to user queries. AI Search serves as the knowledge store for the Retrieval Augmented Generation pattern. This pattern extracts a query from a prompt, queries AI Search, and uses the results as grounding data for a model.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.
This basic architecture isn't intended for production. It favors simplicity and cost efficiency over functionality so that you can learn how to build end-to-end chat applications. The following sections outline deficiencies and recommendations. These omissions are deliberate to minimize setup time. Don't use this topology in production; each omission increases risk.
Reliability
Reliability helps ensure that your application can meet the commitments that you make to your customers. For more information, see Design review checklist for Reliability.
The following list outlines critical reliability features that this architecture omits:
This architecture uses the App Service Basic tier, which doesn't have Azure availability zone support. The instance becomes unavailable if there are problems with the instance, rack, or datacenter. As you move toward production, follow the reliability guidance for App Service instances.
This architecture doesn't enable autoscaling for the client UI. To avoid capacity issues, overprovision compute during learning. Implement autoscale before production.
This architecture deploys Foundry Agent Service as a fully Microsoft-hosted solution. Microsoft hosts dependent services (Cosmos DB, Storage, AI Search) on your behalf. Your subscription doesn't show these resources. You don't control their reliability characteristics. For guidance on bringing your own dependencies, see the baseline architecture.
Note
The AI Search instance in the components section and diagram is different from the instance that's a dependency of Foundry Agent Service. The instance in the components section stores your grounding data. The dependency does real-time chunking of files that are uploaded within a chat session or as part of an agent's definition.
For learning, use the Global Standard model deployment type. Before production, estimate throughput and data residency needs. If you require reserved throughput, choose a Data Zone Provisioned or Global Provisioned deployment type. Use Data Zone Provisioned for explicit residency requirements.
This architecture uses the AI Search Basic tier, which doesn't support Azure availability zones. For zone redundancy, use the Standard tier or higher in a zone-enabled region and deploy three or more replicas.
For more information, see Baseline Microsoft Foundry chat reference architecture.
Security
Security provides assurances against deliberate attacks and the misuse of your valuable data and systems. For more information, see Design review checklist for Security.
This section describes key recommendations that this architecture implements. These recommendations include content filtering and abuse monitoring, identity and access management, and role-based access control. This architecture isn't designed for production deployments, so this section also includes network security considerations. Network security is a key security feature that this architecture doesn't implement.
Content filtering and abuse monitoring
Foundry includes a content filtering system that uses a combination of classification models. This filtering detects and blocks specific categories of potentially harmful content in input prompts and output completions. This potentially harmful content includes hate, sexual content, self-harm, violence, profanity, and jailbreak (content designed to bypass language model restrictions) categories. You can configure the filtering strictness for each category by using low, medium, or high options. This reference architecture uses the DefaultV2 content filter when deploying models. You should adjust the settings according to your requirements.
Identity and access management
The following guidance expands on the identity and access management guidance in the App Service baseline architecture. The chat UI uses its managed identity to authenticate the chat UI API code to Foundry Agent Service by using the Azure AI Persistent Agents SDK.
The Foundry project also has a managed identity. This identity authenticates to services such as AI Search through connection definitions. The project makes those connections available to Foundry Agent Service.
An Foundry account can contain multiple Foundry projects. Each project should use its own managed identity. If different workload components require isolated access to connected data sources, create separate Foundry projects within the same account and avoid sharing connections across them. If your workload doesn't require isolation, use a single project.
Role-based access roles
You're responsible for creating the required role assignments for the managed identities. The following table summarizes the role assignment that you must add to App Service, the Foundry project, and individuals who use the portal:
| Resource | Role | Scope |
|---|---|---|
| App Service | Azure AI User | Foundry account |
| Foundry project | Search Index Data Reader | AI Search |
| Portal user (for each individual) | Azure AI Developer | Foundry account |
Network security
To simplify the learning experience for building an end-to-end chat solution, this architecture doesn't implement network security. It uses identity as its perimeter and uses public cloud constructs. Services such as AI Search, Foundry, and App Service are reachable from the internet. This setup increases the attack surface of the architecture.
This architecture also doesn't restrict egress traffic. For example, an agent can be configured to connect to any public endpoint based on the endpoint's OpenAPI specification. So data exfiltration of private grounding data can't be prevented through network controls.
For more information about network security as an extra perimeter in your architecture, see networking in the baseline architecture.
If you want some network security during your evaluation of this solution, you should use the network security perimeter support on your Foundry project. This approach provides ingress and egress control before you implement virtual network resources in your architecture. When the Foundry Agent Service is configured for standard, private deployment, the network security perimeter is replaced with Private Link connections.
Microsoft Defender for Cloud
For this basic architecture, you don't need to enable Microsoft Defender cloud workload protection plans for any services. When you move to production, follow the security guidance in the baseline architecture for Microsoft Defender, which uses multiple plans to cover your workload.
Governance through policy
This architecture doesn't implement governance through Azure Policy. As you move toward production, follow the governance recommendations in the baseline architecture. Those recommendations add Azure Policy across your workload's components.
Cost Optimization
Cost Optimization focuses on ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Design review checklist for Cost Optimization.
This basic architecture doesn't represent the costs for a production-ready solution. It also doesn't include controls to guard against cost overruns. The following considerations outline crucial features that this architecture doesn't include. These features affect cost.
This architecture assumes limited model calls. Use the Global Standard deployment type (pay-as-you-go) instead of provisioned throughput. As you move toward production, follow the cost optimization guidance in the baseline architecture.
Foundry Agent Service incurs costs for files uploaded during chat interactions. Don't make file upload functionality available to application users if it's not part of the desired user experience. Extra knowledge connections, such as the Grounding with Bing tool, have their own pricing structures.
Foundry Agent Service is a no-code solution. You can't deterministically control which tools or knowledge sources each request invokes. In cost modeling, assume maximum usage of each connection.
This architecture uses the App Service Basic pricing tier on a single instance. It doesn't provide protection from an availability zone outage. The baseline App Service architecture recommends Premium plans with three or more worker instances for high availability.
This architecture uses the AI Search Basic pricing tier with no added replicas. This topology can't withstand a zone failure. The baseline end-to-end chat architecture recommends the Standard tier or higher and three or more replicas.
This architecture doesn't include cost governance or containment controls. Set Azure budgets and alerts early to guard against unexpected token or tool usage.
For budgeting, modify the pricing calculator estimate of this architecture to fit your scenario.
Operational Excellence
Operational Excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Design review checklist for Operational Excellence.
Monitoring
This architecture configures diagnostics for all services. App Service captures AppServiceHTTPLogs, AppServiceConsoleLogs, AppServiceAppLogs, and AppServicePlatformLogs. Foundry captures RequestResponse. During the POC phase, inventory available logs and metrics. Before production, remove sources that don't add value.
To use the monitoring capabilities in Foundry, connect an Application Insights resource to your Microsoft Foundry project.
This integration enables monitoring of:
- Real-time monitoring of token usage, including prompt, completion, and total tokens
- Detailed request-response telemetry, including latency, exceptions, and response quality
You can also trace agents by using OpenTelemetry for distributed diagnostics.
Model operations
This architecture is optimized for learning and isn't intended for production. Plan for model lifecycle management and model deprecation and retirement before promoting workloads.
Development
For the basic architecture, you can create agents by using the browser-based experience in the Foundry portal. When you move toward production, follow the development and source control guidance in the baseline architecture. When you no longer need an agent, be sure to delete it. If the agent that you delete is the last one that uses a connection, also remove the connection.
Evaluation
Evaluate your generative application in Foundry. Learn how to use evaluators. This helps ensure model, prompt, and data quality meet design requirements.
Performance Efficiency
Performance Efficiency refers to your workload's ability to scale to meet user demands efficiently. For more information, see Design review checklist for Performance Efficiency.
This architecture isn't designed for production deployments, so it omits critical performance efficiency features:
Use POC results to choose the right App Service product. Meet demand through horizontal scaling (adjust instance count). Avoid designs that require changing the product tier to handle routine demand.
This architecture uses pay-as-you-go components. Best-effort resource allocation can introduce noisy neighbor effects. Decide whether you need provisioned throughput to reserve capacity and achieve predictable performance.
Other design recommendations
Architects should design AI and machine learning workloads, such as this one, with the Well-Architected AI workloads on Azure guidance in mind. Combine insights from your POC using this architecture with broader AI and machine learning best practices when moving beyond POC.
Deploy this scenario
Deploy a reference implementation that applies these recommendations and considerations.