Share via

Add data evaluation result on dashboard for comparison.

Kriti Kumari 40 Reputation points Microsoft Employee
2026-03-09T09:20:42.2333333+00:00
Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform

0 comments No comments

Answer accepted by question author
  1. Anshika Varshney 9,740 Reputation points Microsoft External Staff Moderator
    2026-03-11T17:36:34.2666667+00:00

    Hi Kriti Kumari,

    At the moment, the Foundry dashboard does not support a single consolidated view where you can add and compare evaluation results from multiple datasets or multiple evaluation runs in one persistent dashboard, like the external comparison site you shared.

    What Foundry supports today is comparison at the evaluation run level, not a custom dashboard view.

    To compare different models or datasets, you need to run evaluations separately for each dataset and model combination. After that, open the evaluation details page in the Foundry portal and select multiple evaluation runs. There is a built in Compare option that shows the results side by side. This comparison helps you see which model or dataset performed better and highlights improvements or regressions based on statistical significance. This comparison view works well for analysis, but it is temporary and not saved as a reusable dashboard.

    If you want deeper insight, you can open each evaluation run and review the aggregated metrics and row level results. This shows the prompt, response, ground truth, and evaluator scores for each record, which is useful when you want to understand why one dataset or model performed better than another.

    For agent-based scenarios, where you want to compare performance over time rather than offline datasets, the Monitor section of the agent gives you charts and metrics based on live traffic. This is meant for observing trends and behavior, not for direct dataset to dataset comparison.

    So in short, comparison is possible today through the evaluation Compare feature, but adding multiple evaluation results into a single custom dashboard view is not available yet in Foundry. If your use case needs long term tracking or custom visuals, many users export evaluation results and build their own comparisons outside the portal.

    Official documentation that explains these capabilities can be found here

    • Self help for Evaluation setup & best practices:

    https://learn.microsoft.com/azure/ai-foundry/concepts/observability?view=foundry-classic

    • See evaluation results in Foundry portal & troubleshooting:

    https://learn.microsoft.com/azure/ai-foundry/how-to/evaluate-results?view=foundry

    • Compare the evaluation results (stat-sig side-by-side):

    https://learn.microsoft.com/azure/foundry/how-to/evaluate-results?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#compare-the-evaluation-results

    • Foundry Project REST API reference (evaluation_comparison): https://learn.microsoft.com/azure/foundry/reference/foundry-project-rest-preview#components

    • Available tools / example prompts for Foundry MCP Server:

    https://learn.microsoft.com/azure/foundry/mcp/available-tools?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#evaluation-operations

    Hope this clarifies the current behavior and available options.

    Thankyou!

    1 person found this answer helpful.

Answer accepted by question author
  1. Q&A Assist
    2026-03-09T09:20:58.4766667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Azure services provide several built-in ways to compare evaluation results and visualize them on dashboards, but they differ from the external sample site.

    To compare evaluation results for different models and datasets and surface them in a dashboard, use these options:

    1. Use Microsoft Foundry evaluation comparison
      For comparing evaluation runs (for example, different models or datasets) in the Foundry portal:
      1. Run evaluations for each model/dataset combination.
      2. Open the Evaluation details page for the evaluation.
      3. Select two or more runs in the evaluation detail page.
      4. Select Compare.
        This generates a side‑by‑side comparison view for all selected runs using statistical t‑testing. You can:
        • Set a baseline run and see how other runs deviate from it.
        • Use the color‑coded stat‑sig legend (ImprovedStrong, ImprovedWeak, DegradedStrong, etc.) to understand which model/dataset combinations are significantly better or worse.
          Note: The comparison view is not persisted; if leaving the page, reselect runs and choose Compare again.
    2. Inspect evaluation results and metrics in Foundry
      To understand and compare results more deeply:
      1. On the Evaluation details page, review aggregate scores for each evaluator per run.
      2. Select a specific run to drill into row‑level results (query, response, ground truth, evaluator scores and explanations).
      3. Use Learn more about metrics to understand how each metric is calculated and what it represents when comparing models.
    3. Use the Agent Monitoring Dashboard for continuous evaluation (agents)
      If comparing agents or models under live traffic rather than offline datasets:
      1. Generate traffic to the agent (run the app or test in the portal).
      2. In the Foundry portal, open the agent and select Monitor.
      3. Review the evaluation‑related charts for the selected time range.
        Successful setup shows evaluation scores over time and a list of evaluation runs with status Completed.
        For programmatic access, list recent evaluation runs and open the report_url from the SDK (for example, using openai_client.evals.runs.list(...) in Python) and then use those reports as the basis for manual or custom dashboard comparisons.
    4. Build custom dashboards for training/evaluation metrics in Azure Machine Learning
      If evaluation metrics are logged as part of Azure Machine Learning jobs (for example, via MLflow or the Azure ML SDK), use the Azure Machine Learning studio dashboard:
      1. Open the job dashboard view in Azure Machine Learning studio.
      2. Use Visualize training jobs and Add charts to plot metrics (accuracy, latency, cost, etc.) across runs. Supported chart types include line, bar, scatter, and parallel coordinates.
      3. Use Compare Tile to compare logged metrics, parameters, and tags between jobs side‑by‑side. A baseline run is set by default; change it by hovering over the display name and selecting the baseline icon.
      4. Enable Show differences only to surface only rows where metrics/parameters differ, making it easier to see which model/dataset combinations perform best.
      5. Optionally, use Resource metrics to monitor resource usage across jobs on a 30‑day rolling basis.
    5. Use Azure Monitor dashboards for Azure OpenAI resource‑level metrics
      For high‑level monitoring (throughput, latency, token usage) across models deployed in Azure OpenAI:
      1. In the Azure portal, open the Azure OpenAI resource and its Overview pane.
      2. Use the built‑in dashboards:
        • Metrics dashboard in the AI Foundry Azure OpenAI resource view.
        • Overview dashboard in the Azure portal.
      3. Dashboards are grouped into: HTTP Requests, Tokens‑Based Usage, PTU Utilization, and Fine‑tuning.
        These dashboards allow comparison of operational metrics across deployments/models, which can complement evaluation‑score comparisons from Foundry or Azure ML.

    To approximate the external sample visualization, combine:

    • Foundry evaluation comparison (for quality metrics across models/datasets), and
    • Azure ML dashboards or Azure Monitor metrics charts (for latency, throughput, and usage), using multiple charts and compare tiles to visualize differences across models.

    References:

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.