Monitor quality and safety of deployed prompt flow applications

Monitoring models that are deployed in production is an essential part of the generative AI application lifecycle. Changes in data and consumer behavior can influence your application over time, resulting in outdated systems that negatively affect business outcomes and expose organizations to compliance, economic, and reputation risks.

Azure AI model monitoring for generative AI applications makes it easier for you to monitor your applications in production for safety and quality on a cadence to ensure it's delivering maximum business value.

Capabilities and integrations for monitoring a prompt flow deployment include:

  • Collect production data using the model data collector.
  • Apply Responsible AI evaluation metrics such as groundedness, coherence, fluency, relevance, and similarity, which are interoperable with prompt flow evaluation metrics.
  • Preconfigured alerts and defaults to run monitoring on a recurring basis.
  • Consume result and configure advanced behavior in Azure AI Studio.

Set up monitoring for prompt flow

Follow these steps to set up monitoring for your prompt flow deployment:

  1. Confirm your flow runs successfully, and that the required inputs and outputs are configured for the metrics you want to assess. The minimum required parameters of collecting only inputs and outputs provide only two metrics: coherence and fluency. You must configure your flow according to the flow and metric configuration requirements.

    Screenshot of prompt flow editor with deploy button.

  2. Deploy your flow. By default, both inferencing data collection and application insights are enabled automatically. These are required for the creation of your monitor.

    Screenshot of basic settings in the deployment wizard.

  3. By default, all outputs of your deployment are collected using Azure AI's Model Data Collector. As an optional step, you can enter the advanced settings to confirm that your desired columns (for example, context of ground truth) are included in the endpoint response.

    Your deployed flow needs to be configured in the following way:

    • Flow inputs & outputs: You need to name your flow outputs appropriately and remember these column names when creating your monitor. In this article, we use the following settings:

      • Inputs (required): "prompt"
      • Outputs (required): "completion"
      • Outputs (optional): "context" and/or "ground truth"
    • Data collection: The inferencing data collection toggle must be enabled using Model Data Collector

    • Outputs: In the prompt flow deployment wizard, confirm the required outputs are selected (such as completion, context, and ground_truth) that meet your metric configuration requirements.

  4. Test your deployment in the deployment Test tab.

    Screenshot of the deployment test page.

    Note

    Monitoring requires the endpoint to be used at least 10 times to collect enough data to provide insights. If you'd like to test sooner, manually send about 50 rows in the 'test' tab before running the monitor.

  5. Create your monitor by either enabling from the deployment details page, or the Monitoring tab.

    Screenshot of the button to enable monitoring.

  6. Ensure your columns are mapped from your flow as defined in the previous requirements.

    Screenshot of columns mapped for monitoring metrics.

  7. View your monitor in the Monitor tab.

    Screenshot of the monitoring result metrics.

By default, operational metrics such as requests per minute and request latency show up. The default safety and quality monitoring signal are configured with a 10% sample rate and run on your default workspace Azure OpenAI connection.

Your monitor is created with default settings:

  • 10% sample rate
  • 4/5 (thresholds / recurrence)
  • Weekly recurrence on Monday mornings
  • Alerts are delivered to the inbox of the person that triggered the monitor.

To view more details about your monitoring metrics, you can follow the link to navigate to monitoring in Azure Machine Learning studio, which is a separate studio that allows for more customizations.

Evaluation metrics

Metrics are generated by the following state-of-the-art GPT language models configured with specific evaluation instructions (prompt templates) which act as evaluator models for sequence-to-sequence tasks. This technique has strong empirical results and high correlation with human judgment when compared to standard generative AI evaluation metrics. For more information about prompt flow evaluation, see Submit bulk test and evaluate a flow and evaluation and monitoring metrics for generative AI.

These GPT models are supported with monitoring and configured as your Azure OpenAI resource:

  • GPT-3.5 Turbo
  • GPT-4
  • GPT-4-32k

The following metrics are supported for monitoring:

Metric Description
Groundedness Measures how well the model's generated answers align with information from the source data (user-defined context.)
Relevance Measures the extent to which the model's generated responses are pertinent and directly related to the given questions.
Coherence Measures the extent to which the model's generated responses are logically consistent and connected.
Fluency Measures the grammatical proficiency of a generative AI's predicted answer.
Similarity Measures the similarity between a source data (ground truth) sentence and the generated response by an AI model.

Flow and metric configuration requirements

When creating your flow, you need to ensure your column names are mapped. The following input data column names are used to measure generation safety and quality:

Input column name Definition Required
Prompt text The original prompt given (also known as "inputs" or "question") Required
Completion text The final completion from the API call that is returned (also known as "outputs" or "answer") Required
Context text Any context data that is sent to the API call, together with original prompt. For example, if you hope to get search results only from certain certified information sources/website, you can define in the evaluation steps. Optional
Ground truth text The user-defined text as the "source of truth" Optional

What parameters are configured in your data asset dictates what metrics you can produce, according to this table:

Metric Prompt Completion Context Ground truth
Coherence Required Required - -
Fluency Required Required - -
Groundedness Required Required Required -
Relevance Required Required Required -
Similarity Required Required - Required

For more information, see question answering metric requirements.

Next steps