Episode

FastTrack for Azure Season 3 Ep11: Monitoring Azure OpenAI

with Victor Santana, Chris Ayers, Marc Mercier

In this session, we will explore the monitoring of Azure OpenAI. Starting with an overview of the big picture of the entire solution, we will then zoom in on Azure OpenAI services through the lens of the Well-Architected Framework (WAF). We will discuss key concepts such as Tokens, Rate Limiting, Quotas, and PTU, as well as Metrics & Alerts for Azure overall, with a focus on reliability and SRE. We will also delve into SLA and performance, including response times.

Specifically, for OpenAI, we will cover concepts like Token Usage, Quota, and Response Times. As we focus on monitoring for resiliency, performance, and response times, we will discuss Metrics, Dashboards, and Alarms. Finally, a detailed dive into diagnostic settings and log analytics, including the use of Kusto.

By the end of this session, you will have a comprehensive understanding of how to monitor Azure OpenAI, and be equipped with the knowledge and tools.

Learning objectives

  • Gain a comprehensive overview of Azure OpenAI monitoring within the Well-Architected Framework.
  • Understand key operational metrics like Tokens, Rate Limiting, Quotas, and PTU relevant to Azure OpenAI.
  • Learn about setting up Metrics, Alerts, and SLAs for effective monitoring and reliability.
  • Master the use of Dashboards and Alarms for monitoring resiliency, performance, and response times in Azure OpenAI.
  • Delve into advanced diagnostic settings and log analytics with Kusto for in-depth monitoring insights.

Chapters

  • 00:00 - Introduction
  • 01:30 - Learning objectives
  • 02:01 - Agenda
  • 03:19 - OpenAI Terms
  • 09:38 - Tokens
  • 11:09 - OpenAI API Quotas
  • 12:41 - Rate Limiting
  • 13:51 - Azure API Management (APIM)
  • 15:23 - APIM Policies
  • 18:11 - APIM Backends
  • 19:00 - APIM Load Balancer & Circuit Breaker
  • 19:46 - Smart Load Balancing for OpenAI Endpoints and APIM
  • 20:47 - Monitoring Azure OpenAI
  • 30:43 - Demo
  • 58:44 - Langfuse on Azure
  • 01:00:06 - Telemetry in Semantic Kernel SDK
  • 01:02:23 - Model monitoring for generative AI applications
  • 01:03:06 - Monitoring published APIs using APIM
  • 01:03:20 - Importing Azure OpenAI APIs into APIM
  • 01:04:23 - Monitoring AI Search

Connect

Intermediate
AI Engineer
Developer
Support Engineer
Azure Monitor