Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Use the Voice Live evaluation harness to measure the quality of voice agents built on the Voice Live API. The harness sends pre-recorded audio through Voice Live, collects transcriptions and responses, and then scores the results with Microsoft Foundry built-in evaluators.
How it works:
Audio Dataset → Voice Live API → Transcript + Response → Foundry Evaluators → Quality Scores
The harness supports two modes:
- Python CLI (code-first) — Run evaluations locally with full control over configuration. The CLI sends audio through Voice Live, generates responses, and submits them to Foundry for scoring.
- Foundry portal — After the CLI generates evaluation output, you can view and compare results in the Microsoft Foundry portal.
Note
The evaluation harness measures conversational quality, such as task completion, intent resolution, and groundedness. It doesn't measure speech-specific metrics like Mean Opinion Score (MOS) or Word Error Rate (WER). It also isn't designed for competitive benchmarking.
For the full source code and developer reference, see the Voice Live evaluation harness on GitHub.
Why evaluate your Voice Live agent?
Voice Live supports multiple models, voice activity detection (VAD) modes, voices, and session configurations. Evaluation helps you make data-driven decisions about which combination works best for your scenario. Common reasons to evaluate include:
- Compare configurations. Measure how different models (such as
gpt-realtimeversusgpt-5), VAD types, or voice settings affect response quality for your specific use case. - Catch regressions. Run evaluations before and after configuration changes to confirm that updates improve quality and don't introduce new problems.
- Identify weak points. Find where your agent struggles, whether it's intent recognition, task completion, tool calling, or response accuracy, so you can improve prompts, tool definitions, or session settings.
- Validate before deployment. Establish a quality baseline across a representative dataset before you move a configuration to production.
Prerequisites
- A Microsoft Foundry project in a supported region. For current availability, see Evaluation regions and limits. Confirmed regions: Sweden Central and East US 2.
- A Foundry resource or Azure Speech in Foundry Tools Services resource that supports Voice Live.
- The Cognitive Services User role on the Voice Live resource for API access. The Azure AI User role on the Foundry project for evaluation. For more information, see Role-based access control for Microsoft Foundry.
- An Azure OpenAI connection with a deployed GPT model that supports chat completion, such as
gpt-5-mini. Voice Live uses this connection for the AI-assisted quality evaluators that score your agent's responses. - Python 3.10 or later.
- Azure CLI with an authenticated session (
az login) for Microsoft Entra authentication (recommended). You can also use an API key. - A test dataset with audio files and ground truth answers. For more information, see Prepare your dataset.
Prepare your dataset
Your dataset is a JSON Lines (JSONL) file where each line represents one evaluation turn. Each turn includes an audio file and optional metadata for scoring.
Required and recommended fields
| Field | Required? | Description |
|---|---|---|
WavPath or input_audio |
Yes | Path to a WAV audio file (16-bit PCM or float32, any sample rate). The harness resamples automatically. You can also use inline base64 audio data. Paths are relative to the JSONL file location. Use clear speech with minimal background noise. |
Answer or expected_output |
Strongly recommended | The expected ground truth answer. Without this field, factual evaluators like groundedness and relevance have no baseline for comparison. |
Question |
Recommended | The text of the user's question. Used for logging and evaluation context. |
system_prompt |
Recommended | System instructions for the Voice Live session. These instructions must match the prompt your agent uses in production. |
conversationID |
For multi-turn | Groups turns into multi-turn conversations. The harness processes all turns with the same ID sequentially in one session. |
tool_definitions |
For tool calling | Function and tool definitions to register with the Voice Live session. Required if your scenario uses function calling. |
Example dataset
{"WavPath": "audio/greeting.wav", "Answer": "Hello! How can I help you today?", "Question": "Hi there", "conversationID": "conv-001", "system_prompt": "You are a helpful customer support agent."}
{"WavPath": "audio/order_status.wav", "Answer": "Your order #12345 shipped yesterday and should arrive by Friday.", "Question": "Where is my order?", "conversationID": "conv-001"}
{"WavPath": "audio/weather.wav", "Answer": "The weather in Seattle is currently 65°F and partly cloudy.", "Question": "What's the weather like?", "conversationID": "conv-002", "system_prompt": "You are a weather assistant.", "tool_definitions": [{"type": "function", "name": "get_weather", "description": "Get current weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}]}
Tip
Dataset quality has a strong effect on evaluation scores. Datasets with specific questions, matching ground truth, and aligned system prompts typically produce higher scores. Datasets with open-ended questions and no ground truth typically produce lower scores, even with the same Voice Live configuration.
Tip
The evaluation harness repository includes a helper script to download audio datasets from HuggingFace as evaluation-ready JSONL files. For details, see the helper scripts documentation.
Multi-turn conversations
To create a multi-turn conversation, group related turns by using the same conversationID. The harness processes all turns in a conversation sequentially within a single Voice Live session and maintains conversation context across turns. The system_prompt and tool_definitions fields apply once per conversation, based on the first turn.
Configure the evaluation
Voice Live session parameters control how the harness processes audio and how the agent responds. You can set these parameters in a JSON config file or as CLI arguments.
Recommended configurations
| Configuration | Model | VAD type | EOU detection | Best for |
|---|---|---|---|---|
| VAD + Realtime (recommended) | gpt-realtime |
azure_semantic_vad_multilingual |
Enabled | Lowest latency. Recommended for most evaluations. |
| VAD + Cascaded | gpt-5 |
azure_semantic_vad_multilingual |
Enabled | Broader model selection (GPT-5, GPT-4.1, Phi). |
| PTT + Realtime (experimental) | gpt-realtime |
server_vad |
Disabled | Testing client-controlled speech boundaries. |
| PTT + Cascaded (experimental) | gpt-5 |
server_vad |
Disabled | Testing client-controlled speech with cascaded models. |
Important
Use VAD mode for all production evaluations. PTT mode is experimental and has a lower audio response rate due to known platform limitations.
Example config file
{
"model": "gpt-realtime",
"voice": "en-US-Ava:DragonHDLatestNeural",
"voice_type": "azure-standard",
"noise_reduction": "azure_deep_noise_suppression",
"echo_cancellation": "server_echo_cancellation",
"vad_type": "azure_semantic_vad_multilingual",
"use_eou_detection": true,
"push_to_talk": false,
"enable_barge_in": true
}
Key parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
string | gpt-realtime |
The Voice Live model to use. Options include gpt-realtime, gpt-5, gpt-4.1, gpt-5-mini, and phi4-mini. |
voice |
string | en-US-Ava:DragonHDLatestNeural |
The Azure text to speech voice. HD voices use the :DragonHDLatestNeural suffix. |
vad_type |
string | azure_semantic_vad_multilingual |
The turn detection type. Options include server_vad, azure_semantic_vad, and azure_semantic_vad_multilingual. |
use_eou_detection |
bool | true |
When set to true, enables semantic end-of-utterance detection. |
noise_reduction |
string | azure_deep_noise_suppression |
The noise reduction type. Set to none to disable. |
push_to_talk |
bool | false |
When set to true, enables push-to-talk mode instead of VAD. |
Pre-built sample configs are available in the evaluation harness repository.
Note
The config file uses a simplified flat-key format for readability. For the full Voice Live API session parameters, see How to use the Voice Live API.
Run the evaluation
CLI quickstart
# Clone the repository
git clone https://github.com/microsoft-foundry/voicelive-evaluation.git
cd voicelive-evaluation/evaluation_harness
# Install dependencies
pip install -r requirements.txt
# Set environment variables
# AZURE_VOICELIVE_ENDPOINT: base resource endpoint (the harness appends the path and API version)
export AZURE_VOICELIVE_ENDPOINT="wss://<your-resource>.services.ai.azure.com"
# PROJECT_ENDPOINT: Foundry project endpoint for evaluation submission
export PROJECT_ENDPOINT="https://<your-resource>.services.ai.azure.com/api/projects/<your-project>"
# Authenticate with Azure CLI (recommended)
az login
# Run evaluation with a config file
python voice_agent_audio_input_evaluation.py \
-f datasets/my-dataset.jsonl \
--config configs/sample_vad_realtime.json \
-o output/my-eval-run
The harness authenticates by using DefaultAzureCredential, which supports Azure CLI login and managed identity. You can also pass an API key with --api-key or set the AZURE_VOICELIVE_API_KEY environment variable.
The harness runs the full pipeline. It sends audio to Voice Live, collects responses, and submits results to Foundry for scoring.
View results in Foundry portal
After the CLI generates evaluation output and submits it to Foundry, you can view and compare results in the Foundry portal. For details, see View evaluation results.
Batch processing
To evaluate multiple datasets in parallel, use the batch processor:
python batch_processor.py --test-files-folder datasets/ --max-workers 4
The batch processor creates subprocesses that write to a shared aggregated JSONL file. It then runs a single evaluation on the combined results.
Understand your results
The evaluation harness uses Microsoft Foundry built-in evaluators to score Voice Live responses. Each evaluator measures a specific aspect of conversational quality.
Default evaluators
| Evaluator | Category | What it measures | Learn more |
|---|---|---|---|
| Intent Resolution | Agent | Whether the agent correctly identified the user's intent | Agent evaluators |
| Task Adherence | Agent | Whether the agent followed system instructions and rules | Agent evaluators |
| Task Completion | Agent | Whether the agent fully completed the requested task | Agent evaluators |
| Response Completeness | RAG | Whether the response covers all critical information from ground truth | RAG evaluators |
| Tool Call Accuracy | Agent (tool) | Whether the agent made correct tool calls with correct parameters | Agent evaluators |
| Tool Selection | Agent (tool) | Whether the agent selected the correct tools | Agent evaluators |
| Tool Input Accuracy | Agent (tool) | Whether tool call parameters are correct | Agent evaluators |
| Tool Output Utilization | Agent (tool) | Whether tool results were used correctly in the response | Agent evaluators |
Additional evaluators available with --evaluators all: groundedness, relevance, tool_call_success, fluency, coherence. For details on each evaluator, see Built-in evaluators reference.
Selecting evaluators
By default, the harness runs the eight evaluators listed in the preceding table. You can customize evaluator selection with the --evaluators flag:
# Run all available evaluators (default + groundedness, relevance, fluency, coherence, tool_call_success)
python voice_agent_audio_input_evaluation.py -f dataset.jsonl --evaluators all
# Run only specific evaluators
python voice_agent_audio_input_evaluation.py -f dataset.jsonl --evaluators intent_resolution task_completion response_completeness
Note
Custom evaluators aren't supported in this release. You can only select from the built-in evaluator set provided by Microsoft Foundry.
Score interpretation
Evaluators produce different output types depending on their category.
Agent and tool evaluators like task_adherence, task_completion, tool_selection, tool_input_accuracy, and tool_call_success produce binary Pass/Fail results. When you aggregate these results across a dataset, they appear as a fraction. For example, 0.67 means 67% of turns passed.
GPT-judge evaluators like intent_resolution, response_completeness, groundedness, relevance, tool_call_accuracy, and tool_output_utilization produce scores on a 1 to 5 scale. The system then converts these scores to Pass/Fail based on a threshold.
| Score range | Interpretation |
|---|---|
| 4.0–5.0 | Strong — response matches ground truth and task requirements |
| 3.0–3.9 | Acceptable — mostly correct with minor gaps |
| 2.0–2.9 | Weak — partial or incomplete response |
| 1.0–1.9 | Poor — incorrect or off-topic response |
Recommended primary metrics: Focus on response_completeness, groundedness, and relevance as primary evaluation metrics.
How many runs are enough? GPT-judge evaluators show natural run-to-run variance of 0.3 to 0.5 points. Run at least three evaluations and average the results before you draw conclusions. Record the evaluator model version for reproducibility.
To view detailed results in the Foundry portal, see View evaluation results.
Troubleshooting low scores
If your evaluation scores are unexpectedly low, complete these steps before you conclude there's a problem with your voice agent:
Check your dataset quality. Missing or mismatched ground truth is the most common cause of low scores. Verify that
Answerfields are accurate and thatsystem_promptmatches your agent's actual configuration.Separate evaluator issues from agent issues. If the Voice Live transcription is accurate but evaluator scores are low, the problem might be in the evaluator or ground truth, not in Voice Live. Check the
responsefield in the output JSONL to see what Voice Live produced.Run multiple evaluations. A single run isn't enough to draw conclusions because of evaluator variance. Average at least three runs before you compare configurations.
Try different session configurations. Different model and VAD settings produce different response quality. Test VAD mode compared to PTT mode, and try different models.
Evaluation data and output
The harness generates structured output at each stage of the pipeline that you can use for debugging and analysis.
Local output
Each evaluation run produces an output folder with the following artifacts:
- Output JSONL — One line per evaluation turn, containing the original dataset fields, the Voice Live response, transcriptions, latency measurements, and evaluator scores.
- Aggregated metrics — Summary statistics across all turns, including per-evaluator averages.
- Run metadata — Configuration, timestamps, API version, and environment details for reproducibility.
Data submitted to Foundry
When the harness submits results to Foundry for scoring, it sends the conversation transcripts, ground truth answers, and system prompts. Audio files aren't uploaded to Foundry. The evaluators score the text-based conversation data and return results to the local output.
API version
The harness uses the Voice Live preview API. The API version is pinned in the harness source code and appended automatically to the endpoint URL. For the current API version, see the harness repository configuration.
Note
Session-level logging and OpenTelemetry integration for Voice Live evaluation aren't available in this release. These capabilities are planned for a future update.
Known issues
Some Foundry built-in evaluators have known issues that can affect Voice Live evaluation results.
| Evaluator | Issue | Impact | Status |
|---|---|---|---|
| Task Adherence | May overemphasize agent self-introduction in multi-turn conversations. | Can inflate scores for greetings and penalize natural follow-ups. | Known limitation |
| Task Completion | Shows high variance across identical runs. | Run-to-run comparison is unreliable for this metric. Average multiple runs. | Known limitation |
| Fluency / Coherence | Tends to return the maximum score regardless of actual response quality. | Provides limited useful signal. Consider excluding these evaluators from analysis. | Known limitation |
| Response Completeness | A known issue can cause misleading scores when comparing certain model configurations. | Validate results by inspecting raw Voice Live responses in the output JSONL. | Mitigated |
For the complete list of known issues and workarounds, see the evaluation harness README.
Limitations
- Text-based evaluators only. Current evaluators assess conversational quality from transcribed text. Speech-specific metrics like MOS, WER, and voice naturalness aren't available.
- Not a competitive benchmarking tool. The harness evaluates Voice Live configurations. It doesn't support comparisons against other platforms.
- Regional availability. The Foundry Evaluations API isn't available in all regions. For current availability, see Evaluation regions and limits.
- Current scope. This release covers Phase 1 functionality. Session logging, portal-native evaluation, and speech-specific evaluators are planned for future releases.