Model monitoring for generative AI applications (preview)
Monitoring models in production is an essential part of the AI lifecycle. Changes in data and consumer behavior can influence your generative AI application over time, resulting in outdated systems that negatively affect business outcomes and expose organizations to compliance, economic, and reputational risks.
Important
Model monitoring for generative AI applications is currently in public preview. These previews are provided without a service-level agreement, and are not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Azure Machine Learning model monitoring for generative AI applications makes it easier for you to monitor your LLM applications in production for safety and quality on a cadence to ensure it's delivering maximum business impact. Monitoring ultimately helps maintain the quality and safety of your generative AI applications. Capabilities and integrations include:
- Collect production data using Model data collector.
- Responsible AI evaluation metrics such as groundedness, coherence, fluency, relevance, and similarity, which are interoperable with Azure Machine Learning prompt flow evaluation metrics.
- Ability to configure alerts for violations based on organizational targets and run monitoring on a recurring basis
- Consume results in a rich dashboard within a workspace in the Azure Machine Learning studio.
- Integration with Azure Machine Learning prompt flow evaluation metrics, analysis of collected production data to provide timely alerts, and visualization of the metrics over time.
For overall model monitoring basic concepts, refer to Model monitoring with Azure Machine Learning (preview). In this article, you learn how to monitor a generative AI application backed by a managed online endpoint. The steps you take are:
Evaluation metrics
Metrics are generated by the following state-of-the-art GPT language models configured with specific evaluation instructions(prompt templates) which act as evaluator models for sequence-to-sequence tasks. This technique has shown strong empirical results and high correlation with human judgment when compared to standard generative AI evaluation metrics. Form more information about prompt flow evaluation, see Submit bulk test and evaluate a flow (preview) for more information about prompt flow evaluation.
These GPT models are supported, and will be configured as your Azure OpenAI resource:
- GPT-3.5 Turbo
- GPT-4
- GPT-4-32k
The following metrics are supported. For more detailed information about each metric, see Monitoring evaluation metrics descriptions and use cases
- Groundedness: evaluates how well the model's generated answers align with information from the input source.
- Relevance: evaluates the extent to which the model's generated responses are pertinent and directly related to the given questions.
- Coherence: evaluates how well the language model can produce output flows smoothly, reads naturally, and resembles human-like language.
- Fluency: evaluates the language proficiency of a generative AI's predicted answer. It assesses how well the generated text adheres to grammatical rules, syntactic structures, and appropriate usage of vocabulary, resulting in linguistically correct and natural-sounding responses.
- Similarity: evaluates the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model.
Metric configuration requirements
The following inputs (data column names) are required to measure generation safety & quality:
- prompt text - the original prompt given (also known as "inputs" or "question")
- completion text - the final completion from the API call that is returned (also known as "outputs" or "answer")
- context text - any context data that is sent to the API call, together with original prompt. For example, if you hope to get search results only from certain certified information sources/website, you can define in the evaluation steps. This is an optional step that can be configured through prompt flow.
- ground truth text - the user-defined text as the "source of truth" (optional)
What parameters are configured in your data asset dictates what metrics you can produce, according to this table:
Metric | Prompt | Completion | Context | Ground truth |
---|---|---|---|---|
Coherence | Required | Required | - | - |
Fluency | Required | Required | - | - |
Groundedness | Required | Required | Required | - |
Relevance | Required | Required | Required | - |
Similarity | Required | Required | - | Required |
Prerequisites
- Azure OpenAI resource: You must have an Azure OpenAI resource created with sufficient quota. This resource is used as your evaluation endpoint.
- Managed identity: Create a User Assigned managed Identity (UAI) and attach it to your workspace using the guidance in Attach user assigned managed identity using CLI v2with sufficient role access, as defined in the next step.
- Role access To assign a role with the required permissions, you need to have the owner or Microsoft.Authorization/roleAssignments/write permission on your resource. Updating connections and permissions may take several minutes to take effect. These additional roles must be assigned to your UAI:
- Resource: Workspace
- Role: Azure Machine Learning Data Scientist
- Workspace connection: following this guidance, you use a managed identity that represents the credentials to the Azure OpenAI endpoint used to calculate the monitoring metrics. DO NOT delete the connection once it's used in the flow.
- API version: 2023-03-15-preview
- Prompt flow deployment: Create a prompt flow runtime following this guidance, run your flow, and ensure your deployment is configured using this article as a guide
- Flow inputs & outputs: You need to name your flow outputs appropriately and remember these column names when creating your monitor. In this article, we use the following:
- Inputs (required): "prompt"
- Outputs (required): "completion"
- Outputs (optional): "context" | "ground truth"
- Data collection: in the "Deployment" (Step #2 of the prompt flow deployment wizard), the 'inference data collection' toggle must be enabled using Model Data Collector
- Outputs: In the Outputs (Step #3 of the prompt flow deployment wizard), confirm you have selected the required outputs listed above (for example, completion | context | ground_truth) that meet your metric configuration requirements
- Flow inputs & outputs: You need to name your flow outputs appropriately and remember these column names when creating your monitor. In this article, we use the following:
Note
If your compute instance is behind a VNet, see Network isolation in prompt flow.
Create your monitor
Create your monitor in the Monitoring overview page
Configure basic monitoring settings
In the monitoring creation wizard, change model task type to prompt & completion, as shown by (A) in the screenshot.
Configure data asset
If you have used Model Data Collector, select your two data assets (inputs & outputs).
Select monitoring signals
- Configure workspace connection (A) in the screenshot.
- Enter your Azure OpenAI evaluator deployment name (B).
- (Optional) Join your production data inputs & outputs: your production model inputs and outputs are automatically joined by the Monitoring service (C). You can customize this if needed, but no action is required. By default, the join column is correlationid.
- (Optional) Configure metric thresholds: An acceptable per-instance score is fixed at 3/5. You can adjust your acceptable overall % passing rate between the range [1,99] %
Manually enter column names from your prompt flow (E). Standard names are ("prompt" | "completion" | "context" | "ground_truth") but you can configure it according to your data asset.
(optional) Set sampling rate (F)
Configure notifications
No action is required. You can configure more recipients if needed.
Confirm monitoring signal configuration
When successfully configured, your monitor should look like this:
Confirm monitoring status
If successfully configured, your monitoring pipeline job shows the following:
Consume results
Monitor overview page
Your monitor overview provides an overview of your signal performance. You can enter your signal details page for more information.
Signal details page
The signal details page allows you to view metrics over time (A) and view histograms of distribution (B).
Resolve alerts
It's only possible to adjust signal thresholds. The acceptable score is fixed at 3/5, and it's only possible to adjust the 'acceptable overall % passing rate' field.