Monitoring Generative AI applications

2025-04-15

As the adoption of generative AI applications continues to grow, so does the necessity for robust monitoring. These applications, powered by intricate data models and algorithms, aren't exempt from the challenges faced by any other software system. Yet, their unique nature makes their monitoring requirements distinctive. Generative AI apps interact with a vast array of data, generate varied outputs, and often operate under tight performance constraints. The quality, performance, and efficiency of these applications directly impact user experience and operational costs. Therefore, a structured approach to monitoring and telemetry isn't only beneficial but critical.

Monitoring offers a real-time lens into an application's health, performance, and functionality. For generative AI, this means observing the model's accuracy, understanding user interactions, optimizing costs, and more. Telemetry provides the raw data necessary for such monitoring, encompassing everything from logs and traces to specific metrics.

This guide walks you through the essentials of monitoring generative AI applications. It offers a roadmap for capturing, analyzing, and acting on telemetry data to help your AI services run efficiently. We focus on key operational telemetry across the entire generative AI application.

Why monitor generative AI applications

Generative AI applications are reshaping how industries operate, making them invaluable assets. Without the right monitoring, even the most sophisticated generative AI application can stumble. Here's why it's paramount to keep a close watch on these systems:

Ensuring model accuracy and reliability: Models evolve, and with evolution can come drifts in accuracy. Continuous monitoring ensures the outputs remain aligned with expectations and standards. Furthermore, as these models learn and adapt, monitoring helps in verifying the consistency and reliability of their predictions.
Detecting anomalies and performance issues: Generative AI can occasionally produce unexpected results or behave erratically due to unforeseen data scenarios or underlying system issues. Monitoring can identify such anomalies, enabling quick mitigation.
Understanding user interactions and feedback: Monitoring user interactions gives insights into how well the application meets user needs. By observing user queries, feedback, and behavior patterns, you can make iterative improvements to enhance the user experience.
Validating costs and optimizing operations: Running AI models, especially at scale, can be resource-intensive. Monitoring provides visibility into resource consumption and operation costs, aiding in optimization and ensuring the most efficient use of available resources.

Basic Concepts in Telemetry

Telemetry is the process of collecting and transmitting data from remote sources to receiving stations for analysis. In the realm of generative AI applications, telemetry involves capturing key operational data to monitor and improve the system's performance and user experience. Here are some foundational concepts:

Logs: Records of events that occur within an application. For generative AI, logs can capture information such as user input, model responses, and any errors or exceptions that arise.
Traces: Traces offer a detailed path of a request as it moves through various components of a system. Tracing can be invaluable in understanding the flow of data from embeddings to chat completions, pinpointing bottlenecks, and troubleshooting issues.
Metrics: These are quantitative measures that give insights into the performance, health, and costs of a system. In AI, metrics can encompass everything from request rate and error percentages to specific model evaluation measures.

Telemetry is the backbone of a well-monitored AI system, offering the insights necessary for continuous improvement. For a deeper dive into these concepts and more, check the Engineering Fundamentals for logging, tracing, and metrics.

Logging

In generative AI applications, logging plays a pivotal role in shedding light on interactions, system behavior, and overall health.

Here are some recommended logs for OpenAI services:

Requests: Logging request metrics, such as response times, stop reasons, and specific model parameters to understand both the demand and performance of the system.
Input prompts: Capturing user inputs helps developers grasp how users are engaging with the system, paving the way for potential model refinements.
Model-generated responses: Logging model outputs facilitates auditing and quality checks, ensuring that the model behaves as intended.

Prompts and responses could be larger than whats appropriate for a log message. If this is the case, save the prompts and responses in a suitable database or storage service and provide a reference ID for later retrieval and analysis.

Developer teams should collect all errors or anomalies for diagnostic purposes. To control log volume, throttle informational logs by using a sampling rate or control them by setting log levels.

Be sure to anonymize and secure sensitive data to uphold user privacy and trust.

Tracing

In generative AI applications, tracing offers a granular, step-by-step view of a request's journey through the system. Each of these individual steps or operations is a "span." A collection of spans forms a trace that represents the complete path and lifecycle of a request.

Here are the primary spans you might typically see in AI workflows:

API call span: Call span represents the inception and duration of an API request. It provides insights into entry points, initial user intentions, and the overarching time taken for the entire request to process.
Service Processing Span: This covers the time and operations when the request navigates through services. It's especially useful to highlight potential bottlenecks or areas in the system needing optimization.
Model Inference Span: This critical span captures the actual time taken by the AI model to process the input and make a prediction or generate a response. It helps gauge the model's efficiency and performance. These spans can also be updated to capture evaluation metrics, whether user driven or AI driven.
Data Fetching Span: Before model processing, there might be a need to fetch supplementary data from databases or other storage using embeddings or other methods of search. This span traces the duration and operation of that data retrieval and can capture accuracy metrics.

Remember to embed privacy and data protection principles when implementing tracing, to keep user data confidential and stay regulatory compliant.

Metrics

Metrics serve as quantifiable measures that shed light on various performance, health, and usage aspects of the system. In addition to tracking metrics from the perspective of the caller to a generative AI application, it is also important to track metrics for each dependency (such as an LLM or data store.) This ensures that spikes in errors or latencies seen by clients can be correlated with spikes observed in dependencies to expedite mitigation or debugging.

Here are some key metrics for generative AI applications:

Request Rates (Requests Per Second): This metric provides insights into the load and demand on the system, enabling scalability planning and indicating popular usage times.
Error Rates: Keeping tabs on the percentage of requests that result in errors is essential. A spike in error rates can indicate problems with the model, the infrastructure, or both.
Latency Metrics: These measure the time taken to process a request. Typically, teams segment them into percentiles like P50 (median), P95, and P99 to show the range of experiences users might have. Monitoring these ensures users receive timely responses.
Model-specific Metrics: Depending on the application, metrics such as BLEU score for translation quality or perplexity for language models might be essential. These offer a gauge of the model's predictive performance.
Cost metrics: Capturing costs, such as the number of tokens consumed, is especially relevant when deploying models in cloud environments. Metrics like cost per transaction or API call offer insights into operational expenses. This includes monitoring the number of tokens in prompts and completions, as these can affect costs.

Reviewing and acting upon these metrics facilitates proactive system tuning, ensures user satisfaction, and helps in maintaining cost-efficiency.

Embedding telemetry

Telemetry for embeddings in an operational context is vital for ensuring that AI systems are providing accurate, relevant, and efficient responses to user queries in real-time. Capturing specific metrics related to embeddings can offer actionable insights into the system's behavior during live user interactions.

Here are key metrics tailored to operational telemetry for embeddings:

Distance and Similarity Measures:
- This metric provides insights into how close related user queries are to the results fetched by the system.
- Monitoring these measures in real-time can help identify if the system is returning highly relevant, diverse, or irrelevant content to users. For instance, consistently close embedding distances for varied user queries might indicate a lack of diversity in results.
Frequency of Specific Embedding Uses:
- By keeping tabs on which embeddings are accessed most frequently during live interactions, operators can discern current user preferences and system trends.
- Frequent access to certain embeddings might indicate high relevance and popularity of specific content. On the flip side, rarely accessed embeddings might hint at content that isn't resonating with users or potential issues with the recommendation or search algorithm.

Incorporating telemetry for these embedding metrics in an operational setting facilitates swift adjustments, ensuring users consistently receive relevant and accurate content. Regular reviews of this telemetry also assist in fine-tuning AI systems to better align with evolving user needs and preferences.

Special considerations for ChatCompletions

ChatCompletions, as a fundamental part of conversational AI, present unique challenges and opportunities in monitoring. These completions, being dynamic and tailored to individual user inputs, can vary widely in quality and relevance. Operational monitoring, therefore, requires specific considerations to ensure the system remains effective during live interactions.

Here are some areas of emphasis:

User Satisfaction Metrics:
- Session Lengths: Monitoring the duration of user sessions can offer insights into engagement levels. Extended interactions may indicate user satisfaction, while abrupt session ends might hint at issues or frustrations.
- Repeat Interactions: Tracking how often users return for multiple sessions can serve as a direct indicator of the perceived value and reliability of the chat system.
Abandoned vs. Completed Interactions:
- Keeping tabs on interactions where users drop off before receiving or after getting a response can help identify potential pitfalls or shortcomings in the AI's response quality or relevancy.
- Analyzing reasons for abandonment (whether due to long response times, unsatisfactory answers, or system errors) can provide actionable insights for improvements.
Context Switching Frequencies and Metrics:
- Context is vital in conversations. Monitoring how often the AI system switches contexts within a session can offer clues about its ability to maintain topic consistency.
- High context-switching might point to issues in the AI's understanding of user intent or its ability to support a coherent conversational flow.

Monitoring infrastructure and tools

Instrumenting applications with telemetry data is critical for understanding and optimizing system performance. A combination of OpenTelemetry and Azure Monitor provides a comprehensive framework for capturing, processing, and visualizing this telemetry. Here's a breakdown of the components and their functionalities:

OpenTelemetry
- Client SDKs: OpenTelemetry offers client SDKs tailored for many programming languages and environments, such as C#, Java, and Python. These SDKs make it easy for developers to seamlessly integrate telemetry collection into their applications.
- Collector: Serving as an intermediary, the OpenTelemetry collector orchestrates the telemetry data flow. It consolidates, processes, possibly redacts sensitive data, and then channels this telemetry to designated storage solutions.
Azure Monitor
- Metrics: Beyond merely storing metrics, Azure Monitor enriches them with visualization tools and alerting capabilities, ensuring teams are always cognizant of system health and performance.
- Traces: Logs and traces ingested by Azure Monitor undergo a detailed analysis, making it simpler to query and dissect the journey of requests/responses within the system. More on Azure Monitor Traces Application Insights overview.
AI Foundry
- Flows: Prompt flow allows developers to string together a combination of tools, LLMs, prompts, and custom code into an executable workflow for building AI applications.
- Tracing: Tracing provides the ability to capture and analyze traces end-to-end, tracking token counts, model responses, and more.
LangSmith:
- LangChain Integration: LangSmith's primary claim to fame is it's native integration with LangChain, which allows for monitoring of generative AI applications developed with that tool set.
- Traces: LangSmith provides integrated tracing capabilities, which are automatically surfaced to a dashboard for easy monitoring and analysis.
- Evaluation: LangSmith provides a suite of evaluation tools to help run both automatic and user-defined tests using datasets. These datasets can be curated, or generated via captured traces.
Open-Source Tools
- Prometheus: Renowned for its monitoring capabilities, Prometheus is an open-source system that provides insights by scrutinizing events and metrics. Its versatility allows integration with a range of platforms, including Azure.
- Grafana: An open-source platform for monitoring and observability, Grafana meshes flawlessly with both OpenTelemetry and Azure Monitor, offering developers advanced visualization tools they can tailor to specific project needs.
- ElasticSearch: A search and analytics engine, Elasticsearch is often chosen by teams who want a scalable search solution combined with log and event data analytics.

Data analysis and insights

Effective monitoring is just the first step. Extracting insights from the deluge of data is what drives meaningful improvements. Here's a brief overview of how to harness this data:

Analyze Across Telemetry Types: Dive into logs, traces, and metrics to discern patterns and irregularities. This analysis paves the way for holistic system insights and decision-making.
Use Dashboards to Spot Trends: Dashboards can be a powerful tool for answering operational questions at a glance. For example, are all model calls for a given purpose similar or do some cause spikes in latency, token counts, or failures? Which agents/tools are the most expensive and which are the cheapest? How are dependencies performing over time? Do bursts in traffic result in degraded behavior? In addition to surfacing the negative aspects, consider dashboards that provide visibility into nominal measures like concurrent conversation counts and min/mean/max turns per conversation.
Automated Alerting: Set up automated alerts that notify the team of anomalies or potential issues, ensuring rapid response and mitigation.
Correlate Metrics: Correlating disparate metrics can unveil deeper insights, spotlighting areas for enhancement that the team might have otherwise overlooked.
Telemetry-driven Feedback Loop: By understanding how models interact with live data and user queries, data scientists and developers can enhance accuracy and user experience.

Conclusion

Monitoring Generative AI applications isn't just about system health. It's a gateway to refinement, understanding, and evolution. By embedding telemetry and leveraging modern tools, teams can illuminate the intricate workings of their AI systems. This insight, when acted upon, results in applications that aren't only robust and efficient but also aligned with user needs. Embrace these practices to be sure your AI applications are always at the forefront of delivering exceptional value.