Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
The voice live API provides a capable WebSocket interface compared to the Azure OpenAI Realtime API.
Unless otherwise noted, the voice live API uses the same events as the Azure OpenAI Realtime API. This document provides a reference for the event message properties that are specific to the voice live API.
Supported models and regions
For a table of supported models and regions, see the voice live API overview.
Authentication
An Azure AI Foundry resource is required to access the voice live API.
WebSocket endpoint
The WebSocket endpoint for the voice live API is wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2025-05-01-preview
.
The endpoint is the same for all models. The only difference is the required model
query parameter.
For example, an endpoint for a resource with a custom domain would be wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2025-05-01-preview&model=gpt-4o-mini-realtime-preview
Credentials
The voice live API supports two authentication methods:
- Microsoft Entra (recommended): Use token-based authentication for an Azure AI Foundry resource. Apply a retrieved authentication token using a
Bearer
token with theAuthorization
header. - API key: An
api-key
can be provided in one of two ways:- Using an
api-key
connection header on the prehandshake connection. This option isn't available in a browser environment. - Using an
api-key
query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
- Using an
For the recommended keyless authentication with Microsoft Entra ID, you need to:
- Assign the
Cognitive Services User
role to your user account or a managed identity. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment. - Generate a token using the Azure CLI or Azure SDKs. The token must be generated with the
https://cognitiveservices.azure.com/.default
scope. - Use the token in the
Authorization
header of the WebSocket connection request, with the formatBearer <token>
.
Session configuration
Often, the first event sent by the caller on a newly established voice live API session is the session.update
event. This event controls a wide set of input and output behavior, with output and response generation properties then later overridable using the response.create
event.
Here's an example session.update
message that configures several aspects of the session, including turn detection, input audio processing, and voice output. Most session parameters are optional and can be omitted if not needed.
{
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
"turn_detection": {
"type": "azure_semantic_vad",
"threshold": 0.3,
"prefix_padding_ms": 200,
"silence_duration_ms": 200,
"remove_filler_words": false,
"end_of_utterance_detection": {
"model": "semantic_detection_v1",
"threshold": 0.01,
"timeout": 2,
},
},
"input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"},
"input_audio_echo_cancellation": {"type": "server_echo_cancellation"},
"voice": {
"name": "en-US-Ava:DragonHDLatestNeural",
"type": "azure-standard",
"temperature": 0.8,
},
}
The server responds with a session.updated
event to confirm the session configuration.
Session Properties
The following sections describe the properties of the session
object that can be configured in the session.update
message.
Tip
For comprehensive descriptions of supported events and properties, see the Azure OpenAI Realtime API events reference documentation. This document provides a reference for the event message properties that are enhancements via the voice live API.
Input audio properties
You can use input audio properties to configure the input audio stream.
Property | Type | Required or optional | Description |
---|---|---|---|
input_audio_sampling_rate |
integer | Optional | The sampling rate of the input audio. The supported values are 16000 and 24000 . The default value is 24000 . |
input_audio_echo_cancellation |
object | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation. Set the type property of input_audio_echo_cancellation to enable echo cancellation.The supported value for type is server_echo_cancellation , which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice. |
input_audio_noise_reduction |
object | Optional | Enhances the input audio quality by suppressing or removing environmental background noise. Set the type property of input_audio_noise_reduction to enable noise suppression.The supported value for type is azure_deep_noise_suppression , which optimizes for speakers closest to the microphone. |
Here's an example of input audio properties is a session object:
{
"input_audio_sampling_rate": 24000,
"input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"},
"input_audio_echo_cancellation": {"type": "server_echo_cancellation"},
}
Noise suppression and echo cancellation
Noise suppression enhances the input audio quality by suppressing or removing environmental background noise. Noise suppression helps the model understand the end-user with higher accuracy and improves accuracy of signals like interruption detection and end-of-turn detection.
Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker and the microphone picks up the model's own voice.
Note
The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than 3 seconds, echo cancellation quality is impacted.
{
"session": {
"input_audio_noise_reduction": {
"type": "azure_deep_noise_suppression"
},
"input_audio_echo_cancellation": {
"type": "server_echo_cancellation"
}
}
}
Conversational enhancements
The voice live API offers conversational enhancements to provide robustness to the natural end-user conversation flow.
Turn Detection Parameters
Turn detection is the process of detecting when the end-user started or stopped speaking. The voice live API builds on the Azure OpenAI Realtime API turn_detection
property to configure turn detection. The azure_semantic_vad
type is one differentiator between the voice live API and the Azure OpenAI Realtime API.
Property | Type | Required or optional | Description |
---|---|---|---|
type |
string | Optional | The type of turn detection system to use. Type server_vad detects start and end of speech based on audio volume.Type azure_semantic_vad detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are ['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm'] . The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.The default value is server_vad . |
threshold |
number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
prefix_padding_ms |
integer | Optional | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
silence_duration_ms |
integer | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
end_of_utterance_detection |
object | Optional | Configuration for end of utterance detection. The voice live API offers advanced end-of-turn detection to indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency. End of utterance detection is only available when using azure_semantic_vad .Properties of end_of_utterance_detection include:- model : The model to use for end of utterance detection. The supported value is semantic_detection_v1 .- threshold : Threshold to determine the end of utterance (0.0 to 1.0). The default value is 0.01.- timeout : Timeout in seconds. The default value is 2 seconds. |
Here's an example of end of utterance detection in a session object:
{
"session": {
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
"turn_detection": {
"type": "azure_semantic_vad",
"threshold": 0.3,
"prefix_padding_ms": 300,
"silence_duration_ms": 500,
"remove_filler_words": false,
"end_of_utterance_detection": {
"model": "semantic_detection_v1",
"threshold": 0.01,
"timeout": 2
}
}
}
}
Audio input through Azure speech to text
Phrase list
Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the session.update
message.
{
"session": {
"input_audio_transcription": {
"model": "azure-fast-transcription",
"phrase_list": ["Neo QLED TV", "TUF Gaming", "AutoQuote Explorer"]
}
}
}
Note
Phrase list currently doesn't support gpt-4o-realtime-preview, gpt-4o-mini-realtime-preview, and phi4-mm-realtime. To learn more about phrase list, see phrase list for speech to text.
Audio output through Azure text to speech
You can use the voice
parameter to specify a standard or custom voice. The voice is used for audio output.
The voice
object has the following properties:
Property | Type | Required or optional | Description |
---|---|---|---|
name |
string | Required | Specifies the name of the voice. For example, en-US-AvaNeural . |
type |
string | Required | Configuration of the type of Azure voice between azure-standard and azure-custom . |
temperature |
number | Optional | Specifies temperature applicable to Azure HD voices. Higher values provide higher levels of variability in intonation, prosody, etc. |
Azure standard voices
Here's a partial message example for a standard (azure-standard
) voice:
{
"voice": {
"name": "en-US-AvaNeural",
"type": "azure-standard"
}
}
For the full list of standard voices, see Language and voice support for the Speech service.
Azure high definition voices
Here's an example session.update
message for a standard high definition voice:
{
"voice": {
"name": "en-US-Ava:DragonHDLatestNeural",
"type": "azure-standard",
"temperature": 0.8 // optional
}
}
For the full list of standard high definition voices, see high definition voices documentation.
Azure custom voices
You can use a custom voice for audio output. For information about how to create a custom voice, see What is custom voice.
{
"voice": {
"name": "en-US-CustomNeural",
"type": "azure-custom",
"endpoint_id": "your-endpoint-id", // a guid string
"temperature": 0.8 // optional, value range 0.0-1.0, only take effect when using HD voices
}
}
Custom lexicon
Use the custom_lexicon_url
string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see custom lexicon for text to speech.
{
"voice": {
"name": "en-US-Ava:DragonHDLatestNeural",
"type": "azure-standard",
"temperature": 0.8, // optional
"custom_lexicon_url": "<custom lexicon url>"
}
}
Speaking rate
Use the rate
string property to adjust the speaking speed for any standard Azure text to speech voices and custom voices.
The rate value should range from 0.5 to 1.5, with higher values indicating faster speeds.
{
"voice": {
"name": "en-US-Ava:DragonHDLatestNeural",
"type": "azure-standard",
"temperature": 0.8, // optional
"rate": "1.2"
}
}
Audio timestamps
When you use Azure voices, and output_audio_timestamp_types
is configured, the service returns the response.audio_timestamp.delta
in the response, and response.audio_timestamp.done
when the all timestamps message are returned.
To configure the audio timestamps, you can set the output_audio_timestamp_types
in the session.update message.
{
"session": {
"output_audio_timestamp_types": ["word"]
}
}
Service returns the audio timestamps in the response when the audio is generated.
{
"event_id": "<event_id>",
"type": "response.audio_timestamp.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"audio_offset_ms": 490,
"audio_duration_ms": 387,
"text": "end",
"timestamp_type": "word"
}
And a response.audio_timestamp.done
message is sent when all timestamps are returned.
{
"event_id": "<event_id>",
"type": "response.audio_timestamp.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
}
Viseme
You can use Azure standard voice or Azure custom voice with animation.outputs
set to {"viseme_id"}
. The service returns the response.animation_viseme.delta
in the response and response.animation_viseme.done
when all viseme messages are returned.
To configure the viseme, you can set the animation.outputs
in the session.update
message. The animation.outputs
parameter is optional. It configures which animation outputs should be returned. Currently, it only supports viseme_id
.
{
"type": "session.update",
"event_id": "your-session-id",
"session": {
"voice": {
"name": "en-US-AvaNeural",
"type": "azure-standard",
},
"modalities": ["text", "audio"],
"instructions": "You are a helpful AI assistant responding in natural, engaging language.",
"turn_detection": {
"type": "server_vad"
},
"output_audio_timestamp_types": ["word"], // optional
"animation": {
"outputs": ["viseme_id"], // optional
},
}
}
The output_audio_timestamp_types
parameter is optional. It configures which audio timestamps should be returned for generated audio. Currently, it only supports word
.
The service returns the viseme alignment in the response when the audio is generated.
{
"event_id": "<event_id>",
"type": "response.animation_viseme.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"audio_offset_ms": 455,
"viseme_id": 20
}
And a response.animation_viseme.done
message is sent when all viseme messages are returned.
{
"event_id": "<event_id>",
"type": "response.animation_viseme.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
}
Azure text to speech avatar
Text to speech avatar converts text into a digital video of a photorealistic human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice.
You can use the avatar
parameter to specify a standard or custom avatar. The avatar is synchronized with the audio output.
An avatar
parameter can be specified to enable avatar output that is synchronized with the audio output:
{
"session": {
"avatar": {
"character": "lisa",
"style": "casual-sitting",
"customized": false,
"ice_servers": [
{
"urls": ["REDACTED"],
"username": "",
"credential": ""
}
],
"video": {
"bitrate": 2000000,
"codec": "h264",
"crop": {
"top_left": [560, 0],
"bottom_right": [1360, 1080],
},
"resolution": {
"width": 1080,
"height": 1920,
},
"background": {
"color": "#00FF00FF"
// "image_url": "https://example.com/example.jpg"
}
}
}
}
}
The ice_servers
field is optional. If you don't specify it, the service returns the server-specific ICE servers in session.updated
response. And you need to use the server-specific ICE servers to generate the local ICE candidates.
Send the client SDP after ICE candidates are gathered.
{
"type": "session.avatar.connect",
"client_sdp": "your-client-sdp"
}
And the service responds with the server SDP.
{
"type": "session.avatar.connecting",
"server_sdp": "your-server-sdp"
}
Then you can connect the avatar with the server SDP.
Related content
- Try out the voice live API quickstart
- See the audio events reference