Voice Live `2026-06-01-preview` API Reference

The Voice Live API provides real-time, bidirectional communication for voice-enabled applications using WebSocket connections.

The API uses JSON-formatted events sent over WebSocket connections to manage conversations, audio streams, avatar interactions, and real-time responses. Events are categorized into client events (sent from client to server) and server events (sent from server to client).

Note

2026-06-01-preview is a preview API version. Features and properties marked preview are subject to change before the next stable release.

What's new in 2026-06-01-preview

This API version adds the following capabilities on top of 2026-04-10:

azure-realtime-native voice type: A new structured voice object used exclusively with the azure-realtime model. The voice is specified as {"type": "azure-realtime-native", "name": "<voice>"} where <voice> is one of aarti, andrew, ava (default), denise, elsa, florian, francisca, meera, ximena, xiaoxiao, or yunxi.
Streaming text input client events: New input_text.delta and input_text.done client events let you stream text input into a conversation item incrementally, similar to how audio is streamed with input_audio_buffer.append.
Smart end-of-turn detection: New audio-based EOU detection variant with "model": "smart_end_of_turn_detection". It operates directly on the input audio stream and exposes the threshold_level (low, medium, high, default) and timeout_ms properties.
Parallel tool calls: New optional parallel_tool_calls boolean on the session object (default true). Set to false to require the model to issue tool calls sequentially.
Hosted agent invocation events: New server events for surfacing hosted agent invocation lifecycle and tool activity.
WebRTC feature events: Additional events that support the Voice Live WebRTC transport.

Endpoint and authentication

WebSocket endpoint

The WebSocket endpoint for the Voice Live API is:

wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime?api-version=2026-06-01-preview

For older resources that use the legacy domain, use:

wss://<your-ai-foundry-resource-name>.cognitiveservices.azure.com/voice-live/realtime?api-version=2026-06-01-preview

The endpoint is the same for all models. The only difference is the required model query parameter, or, when using the Microsoft Foundry Agent Service, the agent-name and agent-project-name query parameters. For more information about agent connection parameters, see Integrate Voice Live API with a Microsoft Foundry agent.

For example, an endpoint for a Microsoft Foundry resource that uses a model would be:

wss://<your-ai-foundry-resource-name>.services.ai.azure.com/voice-live/realtime?api-version=2026-06-01-preview&model=gpt-realtime

Note

The Voice Live API is optimized for Microsoft Foundry resources. Microsoft Foundry resources are recommended for full feature availability. Azure AI Speech resources don't support Microsoft Foundry Agent Service integration or bring-your-own-model (BYOM).

Authentication

The Voice Live API supports two authentication methods:

Microsoft Entra ID (recommended): Use token-based authentication for a Microsoft Foundry resource. Pass the retrieved access token in one of two ways:
- As a Bearer token in the Authorization header on the prehandshake connection. This option isn't available in a browser environment.
- As an Authorization query string parameter on the request URI, with the value Bearer <token>. URL-encode the value as needed. Query string parameters are encrypted by the wss:// transport.
API key: Provide an api-key in one of two ways:
- As an api-key connection header on the prehandshake connection. This option isn't available in a browser environment.
- As an api-key query string parameter on the request URI. Query string parameters are encrypted by the wss:// transport.

For the recommended keyless authentication with Microsoft Entra ID:

Assign the Cognitive Services User and Azure AI User roles to your user account or managed identity. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Acquire an access token using the Azure CLI or an Azure SDK. The token must be issued for the https://ai.azure.com/.default scope (or the legacy https://cognitiveservices.azure.com/.default scope).
Send the token on the WebSocket upgrade request, either in the Authorization header in the format Bearer <token>, or as an Authorization query string parameter with the same Bearer <token> value.

Client Events

The Voice Live API supports the following client events that can be sent from the client to the server:

Event	Description
session.update	Update the session configuration including voice, output modalities, turn detection, and other settings
session.avatar.connect	Establish avatar connection by providing client SDP for WebRTC negotiation
input_audio_buffer.append	Append audio bytes to the input audio buffer
input_audio_buffer.commit	Commit the input audio buffer for processing
input_audio_buffer.clear	Clear the input audio buffer
input_text.delta	Append a chunk of text to a streamed user-text input
input_text.done	Signal that streamed user-text input is complete
conversation.item.create	Add a new item to the conversation context
conversation.item.retrieve	Retrieve a specific item from the conversation
conversation.item.truncate	Truncate an assistant audio message
conversation.item.delete	Remove an item from the conversation
response.create	Instruct the server to create a response via model inference
response.cancel	Cancel an in-progress response
output_audio_buffer.clear	Stop the avatar from speaking by clearing the server-side output audio buffer (avatar mode only)

session.update

Update the session's configuration. This event can be sent at any time to modify settings such as voice, output modalities, turn detection, tools, and other session parameters. Note that once a session is initialized with a particular model, it can't be changed to another model.

Event Structure

{
  "type": "session.update",
  "session": {
    "modalities": ["text", "audio"],
    "voice": {
      "type": "openai",
      "name": "alloy"
    },
    "instructions": "You are a helpful assistant. Be concise and friendly.",
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_sampling_rate": 24000,
    "turn_detection": {
      "type": "azure_semantic_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 500
    },
    "temperature": 0.8,
    "max_response_output_tokens": "inf"
  }
}

Properties

Field	Type	Description
type	string	Must be `"session.update"`
session	RealtimeRequestSession	Session configuration object with fields to update

Example with Azure Custom Voice

{
  "type": "session.update",
  "session": {
    "voice": {
      "type": "azure-custom",
      "name": "my-custom-voice",
      "endpoint_id": "12345678-1234-1234-1234-123456789012",
      "temperature": 0.7,
      "style": "cheerful"
    },
    "input_audio_noise_reduction": {
      "type": "azure_deep_noise_suppression"
    },
    "avatar": {
      "character": "lisa",
      "customized": false,
      "video": {
        "resolution": {
          "width": 1920,
          "height": 1080
        },
        "bitrate": 2000000
      }
    }
  }
}

session.avatar.connect

Establish an avatar connection by providing the client's SDP (Session Description Protocol) offer for WebRTC media negotiation. This event is required when using avatar features.

Event Structure

{
  "type": "session.avatar.connect",
  "client_sdp": "<client_sdp>"
}

Properties

Field	Type	Description
type	string	Must be `"session.avatar.connect"`
client_sdp	string	The client's SDP offer for WebRTC connection establishment, encoded with base64

input_audio_buffer.append

Append audio bytes to the input audio buffer.

Event Structure

{
  "type": "input_audio_buffer.append",
  "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA="
}

Properties

Field	Type	Description
type	string	Must be `"input_audio_buffer.append"`
audio	string	Base64-encoded audio data

input_audio_buffer.commit

Commit the input audio buffer for processing.

Event Structure

{
  "type": "input_audio_buffer.commit"
}

Properties

Field	Type	Description
type	string	Must be `"input_audio_buffer.commit"`

input_audio_buffer.clear

Clear the input audio buffer.

Event Structure

{
  "type": "input_audio_buffer.clear"
}

Properties

Field	Type	Description
type	string	Must be `"input_audio_buffer.clear"`

input_text.delta

Append a chunk of text to the current streamed user-text input. Use this event to stream text into a conversation item incrementally, similar to how audio is streamed with input_audio_buffer.append. The streamed text is finalized by sending an input_text.done event.

Event Structure

{
  "type": "input_text.delta",
  "delta": "Hello, "
}

Properties

Field	Type	Description
type	string	Must be `"input_text.delta"`
delta	string	The incremental text content to append to the current streamed input.

input_text.done

Signal that the streamed user-text input is complete. The accumulated text becomes a user message item in the conversation.

Event Structure

{
  "type": "input_text.done"
}

Properties

Field	Type	Description
type	string	Must be `"input_text.done"`

conversation.item.create

Add a new item to the conversation context. This can include messages, function calls, and function call responses. Items can be inserted at specific positions in the conversation history.

Event Structure

{
  "type": "conversation.item.create",
  "previous_item_id": "item_ABC123",
  "item": {
    "id": "item_DEF456",
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Hello, how are you?"
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.create"`
previous_item_id	string	Optional. ID of the item after which to insert this item. If not provided, appends to end
item	RealtimeConversationRequestItem	The item to add to the conversation

Example with Audio Content

{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA=",
        "transcript": "Hello there"
      }
    ]
  }
}

Example with Function Call output

{
  "type": "conversation.item.create",
  "item": {
    "type": "function_call_output",
    "call_id": "call_123",
    "output": "{\"location\": \"San Francisco\", \"temperature\": \"70\"}"
  }
}

Example with MCP approval response

{
  "type": "conversation.item.create",
  "item": {
    "type": "mcp_approval_response",
    "approval_request_id": "mcp_approval_req_456",
    "approve": true,
  }
}

conversation.item.retrieve

Retrieve a specific item from the conversation history. This is useful for inspecting processed audio after noise cancellation and VAD.

Event Structure

{
  "type": "conversation.item.retrieve",
  "item_id": "item_ABC123"
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.retrieve"`
item_id	string	The ID of the item to retrieve

conversation.item.truncate

Truncate an assistant message's audio content. This is useful for stopping playback at a specific point and synchronizing the server's understanding with the client's state.

Event Structure

{
  "type": "conversation.item.truncate",
  "item_id": "item_ABC123",
  "content_index": 0,
  "audio_end_ms": 5000
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.truncate"`
item_id	string	The ID of the assistant message item to truncate
content_index	integer	The index of the content part to truncate
audio_end_ms	integer	The duration up to which to truncate the audio, in milliseconds

conversation.item.delete

Remove an item from the conversation history.

Event Structure

{
  "type": "conversation.item.delete",
  "item_id": "item_ABC123"
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.delete"`
item_id	string	The ID of the item to delete

response.create

Instruct the server to create a response via model inference. This event can specify response-specific configuration that overrides session defaults.

Event Structure

{
  "type": "response.create",
  "response": {
    "modalities": ["text", "audio"],
    "instructions": "Be extra helpful and detailed.",
    "voice": {
      "type": "openai",
      "name": "alloy"
    },
    "output_audio_format": "pcm16",
    "temperature": 0.7,
    "max_response_output_tokens": 1000
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.create"`
response	RealtimeResponseOptions	Optional response configuration that overrides session defaults

Example with Tool Choice

{
  "type": "response.create",
  "response": {
    "modalities": ["text"],
    "tools": [
      {
        "type": "function",
        "name": "get_current_time",
        "description": "Get the current time",
        "parameters": {
          "type": "object",
          "properties": {}
        }
      }
    ],
    "tool_choice": "get_current_time",
    "temperature": 0.3
  }
}

Example with Animation

{
  "type": "response.create",
  "response": {
    "modalities": ["audio", "animation"],
    "animation": {
      "model_name": "default",
      "outputs": ["blendshapes", "viseme_id"]
    },
    "voice": {
      "type": "azure-custom",
      "name": "my-expressive-voice",
      "endpoint_id": "12345678-1234-1234-1234-123456789012",
      "style": "excited"
    }
  }
}

Example with pre-generated assistant message

In some scenarios, you might want to generate an audio response for predefined text instead of having the model generate the text response. Use the pre_generated_assistant_message parameter in the response.create message. You can only include one text entry in the content field.

{
  "type": "response.create",
  "response": {
    "pre_generated_assistant_message": {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "repeat what I say"
        }
      ]
    }
  }
}

When the service receives this message, it generates an audio response for the predefined text. The message is also added to the conversation context history.

response.cancel

Cancel an in-progress response. This immediately stops response generation and related audio output.

Event Structure

{
  "type": "response.cancel"
}

Properties

Field	Type	Description
type	string	Must be `"response.cancel"`

output_audio_buffer.clear

Clear the server-side output audio buffer. In the current preview, this event is only supported in avatar mode and is used to stop the avatar from speaking by clearing any audio (and corresponding avatar video) that the server has queued for playback. The server responds with an output_audio_buffer.cleared event.

Event Structure

{
  "type": "output_audio_buffer.clear"
}

Properties

Field	Type	Description
type	string	Must be `"output_audio_buffer.clear"`

input_audio_buffer.append

The client input_audio_buffer.append event is used to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit.

In Server VAD (Voice Activity Detection) mode, the audio buffer is used to detect speech and the server decides when to commit. When server VAD is disabled, the client can choose how much audio to place in each event up to a maximum of 15 MiB. For example, streaming smaller chunks from the client can allow the VAD to be more responsive.

Unlike most other client events, the server doesn't send a confirmation response to client input_audio_buffer.append event.

Event structure

{
  "type": "input_audio_buffer.append",
  "audio": "<audio>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.append`.
audio	string	Base64-encoded audio bytes. This value must be in the format specified by the `input_audio_format` field in the session configuration.

input_audio_buffer.clear

The client input_audio_buffer.clear event is used to clear the audio bytes in the buffer.

The server responds with an input_audio_buffer.cleared event.

Event structure

{
  "type": "input_audio_buffer.clear"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.clear`.

input_audio_buffer.commit

The client input_audio_buffer.commit event is used to commit the user input audio buffer, which creates a new user message item in the conversation. Audio is transcribed if input_audio_transcription is configured for the session.

When in server VAD mode, the client doesn't need to send this event, the server commits the audio buffer automatically. Without server VAD, the client must commit the audio buffer to create a user message item. This client event produces an error if the input audio buffer is empty.

Committing the input audio buffer doesn't create a response from the model.

The server responds with an input_audio_buffer.committed event.

Event structure

{
  "type": "input_audio_buffer.commit"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.commit`.

Server Events

The Voice Live API sends the following server events to communicate status, responses, and data to the client:

Event	Description
error	Indicates an error occurred during processing
warning	Indicates a warning occurred that doesn't interrupt the conversation flow
session.created	Sent when a new session is successfully established
session.updated	Sent when session configuration is updated
session.avatar.connecting	Indicates avatar WebRTC connection is being established
conversation.item.created	Sent when a new item is added to the conversation
conversation.item.retrieved	Response to conversation.item.retrieve request
conversation.item.truncated	Confirms item truncation
conversation.item.deleted	Confirms item deletion
conversation.item.input_audio_transcription.completed	Input audio transcription is complete
conversation.item.input_audio_transcription.delta	Streaming input audio transcription
conversation.item.input_audio_transcription.failed	Input audio transcription failed
input_audio_buffer.committed	Input audio buffer was for processing
input_audio_buffer.cleared	Input audio buffer was cleared
input_audio_buffer.speech_started	Speech detected in input audio buffer (VAD)
input_audio_buffer.speech_stopped	Speech ended in input audio buffer (VAD)
response.created	New response generation started
response.done	Response generation is complete
response.output_item.added	New output item added to response
response.output_item.done	Output item is complete
response.content_part.added	New content part added to output item
response.content_part.done	Content part is complete
response.text.delta	Streaming text content from the model
response.text.done	Text content is complete
response.audio_transcript.delta	Streaming audio transcript
response.audio_transcript.done	Audio transcript is complete
response.audio.delta	Streaming audio content from the model
response.audio.done	Audio content is complete
response.animation_blendshapes.delta	Streaming animation blendshapes data
response.animation_blendshapes.done	Animation blendshapes data is complete
response.audio_timestamp.delta	Streaming audio timestamp information
response.audio_timestamp.done	Audio timestamp information is complete
response.animation_viseme.delta	Streaming animation viseme data
response.animation_viseme.done	Animation viseme data is complete
response.function_call_arguments.delta	Streaming function call arguments
response.function_call_arguments.done	Function call arguments are complete
mcp_list_tools.in_progress	MCP tool listing is in progress
mcp_list_tools.completed	MCP tool listing is completed
mcp_list_tools.failed	MCP tool listing has failed
response.mcp_call_arguments.delta	Streaming MCP call arguments
response.mcp_call_arguments.done	MCP call arguments are complete
response.mcp_call.in_progress	MCP call is in progress
response.mcp_call.completed	MCP call is completed
response.mcp_call.failed	MCP call has failed
response.foundry_agent_call_arguments.delta	Streaming foundry agent call arguments
response.foundry_agent_call_arguments.done	Foundry agent call arguments are complete
response.foundry_agent_call.in_progress	Foundry agent call is in progress
response.foundry_agent_call.completed	Foundry agent call is completed
response.foundry_agent_call.failed	Foundry agent call has failed
session.avatar.switch_to_speaking	Avatar transitioned to the speaking state
session.avatar.switch_to_idle	Avatar transitioned to the idle state
response.video.delta	Streaming avatar video frame data
response.web_search_call.searching	Web search tool call is searching
response.web_search_call.in_progress	Web search tool call is in progress
response.web_search_call.completed	Web search tool call completed
response.file_search_call.searching	File search tool call is searching
response.file_search_call.in_progress	File search tool call is in progress
response.file_search_call.completed	File search tool call completed
output_audio_buffer.cleared	Output audio buffer was cleared
response.audio_transcript.annotation.added	An annotation was added to an audio transcript

session.created

Sent when a new session is successfully established. This is the first event received after connecting to the API.

Event Structure

{
  "type": "session.created",
  "session": {
    "id": "sess_ABC123DEF456",
    "object": "realtime.session",
    "model": "gpt-realtime",
    "modalities": ["text", "audio"],
    "instructions": "You are a helpful assistant.",
    "voice": {
      "type": "openai",
      "name": "alloy"
    },
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "input_audio_sampling_rate": 24000,
    "turn_detection": {
      "type": "azure_semantic_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 500
    },
    "temperature": 0.8,
    "max_response_output_tokens": "inf"
  }
}

Properties

Field	Type	Description
type	string	Must be `"session.created"`
session	RealtimeResponseSession	The created session object

session.updated

Sent when session configuration is successfully updated in response to a session.update client event.

Event Structure

{
  "type": "session.updated",
  "session": {
    "id": "sess_ABC123DEF456",
    "voice": {
      "type": "azure-custom",
      "name": "my-voice",
      "endpoint_id": "12345678-1234-1234-1234-123456789012"
    },
    "temperature": 0.7,
    "avatar": {
      "character": "lisa",
      "customized": false
    }
  }
}

Properties

Field	Type	Description
type	string	Must be `"session.updated"`
session	RealtimeResponseSession	The updated session object

session.avatar.connecting

Indicates that an avatar WebRTC connection is being established. This event is sent in response to a session.avatar.connect client event.

Event Structure

{
  "type": "session.avatar.connecting",
  "server_sdp": "<server_sdp>"
}

Properties

Field	Type	Description
type	string	Must be `"session.avatar.connecting"`

conversation.item.created

Sent when a new item is added to the conversation, either through a client conversation.item.create event or automatically during response generation.

Event Structure

{
  "type": "conversation.item.created",
  "previous_item_id": "item_ABC123",
  "item": {
    "id": "item_DEF456",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Hello, how are you?"
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.created"`
previous_item_id	string	ID of the item after which this item was inserted
item	RealtimeConversationResponseItem	The created conversation item

Example with Audio Item

{
  "type": "conversation.item.created",
  "item": {
    "id": "item_GHI789",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "audio": null,
        "transcript": "What's the weather like today?"
      }
    ]
  }
}

conversation.item.retrieved

Sent in response to a conversation.item.retrieve client event, providing the requested conversation item.

Event Structure

{
  "type": "conversation.item.retrieved",
  "item": {
    "id": "item_ABC123",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "assistant",
    "content": [
      {
        "type": "audio",
        "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA=",
        "transcript": "Hello! I'm doing well, thank you for asking. How can I help you today?"
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.retrieved"`
item	RealtimeConversationResponseItem	The retrieved conversation item

conversation.item.truncated

The server conversation.item.truncated event is returned when the client truncates an earlier assistant audio message item with a conversation.item.truncate event. This event is used to synchronize the server's understanding of the audio with the client's playback.

This event truncates the audio and removes the server-side text transcript to ensure there's no text in the context that the user doesn't know about.

Event structure

{
  "type": "conversation.item.truncated",
  "item_id": "<item_id>",
  "content_index": 0,
  "audio_end_ms": 0
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.truncated`.
item_id	string	The ID of the assistant message item that was truncated.
content_index	integer	The index of the content part that was truncated.
audio_end_ms	integer	The duration up to which the audio was truncated, in milliseconds.

conversation.item.deleted

Sent in response to a conversation.item.delete client event, confirming that the specified item was removed from the conversation.

Event Structure

{
  "type": "conversation.item.deleted",
  "item_id": "item_ABC123"
}

Properties

Field	Type	Description
type	string	Must be `"conversation.item.deleted"`
item_id	string	ID of the deleted item

response.created

Sent when a new response generation begins. This is the first event in a response sequence.

Event Structure

{
  "type": "response.created",
  "response": {
    "id": "resp_ABC123",
    "object": "realtime.response",
    "status": "in_progress",
    "status_details": null,
    "output": [],
    "usage": {
      "total_tokens": 0,
      "input_tokens": 0,
      "output_tokens": 0
    }
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.created"`
response	RealtimeResponse	The response object that was created

response.done

Sent when response generation is complete. This event contains the final response with all output items and usage statistics.

Event Structure

{
  "type": "response.done",
  "response": {
    "id": "resp_ABC123",
    "object": "realtime.response",
    "status": "completed",
    "status_details": null,
    "output": [
      {
        "id": "item_DEF456",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
          {
            "type": "text",
            "text": "Hello! I'm doing well, thank you for asking. How can I help you today?"
          }
        ]
      }
    ],
    "usage": {
      "total_tokens": 87,
      "input_tokens": 52,
      "output_tokens": 35,
      "input_token_details": {
        "cached_tokens": 0,
        "text_tokens": 45,
        "audio_tokens": 7
      },
      "output_token_details": {
        "text_tokens": 15,
        "audio_tokens": 20
      }
    }
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.done"`
response	RealtimeResponse	The completed response object

response.output_item.added

Sent when a new output item is added to the response during generation.

Event Structure

{
  "type": "response.output_item.added",
  "response_id": "resp_ABC123",
  "output_index": 0,
  "item": {
    "id": "item_DEF456",
    "object": "realtime.item",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.output_item.added"`
response_id	string	ID of the response this item belongs to
output_index	integer	Index of the item in the response's output array
item	RealtimeConversationResponseItem	The output item that was added

response.output_item.done

Sent when an output item is complete.

Event Structure

{
  "type": "response.output_item.done",
  "response_id": "resp_ABC123",
  "output_index": 0,
  "item": {
    "id": "item_DEF456",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "Hello! I'm doing well, thank you for asking."
      }
    ]
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.output_item.done"`
response_id	string	ID of the response this item belongs to
output_index	integer	Index of the item in the response's output array
item	RealtimeConversationResponseItem	The completed output item

response.content_part.added

The server response.content_part.added event is returned when a new content part is added to an assistant message item during response generation.

Event Structure

{
  "type": "response.content_part.added",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "part": {
    "type": "text",
    "text": ""
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.content_part.added"`
response_id	string	ID of the response
item_id	string	ID of the item this content part belongs to
output_index	integer	Index of the item in the response
content_index	integer	Index of this content part in the item
part	RealtimeContentPart	The content part that was added

response.content_part.done

The server response.content_part.done event is returned when a content part is done streaming in an assistant message item.

This event is also returned when a response is interrupted, incomplete, or cancelled.

Event Structure

{
  "type": "response.content_part.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "part": {
    "type": "text",
    "text": "Hello! I'm doing well, thank you for asking."
  }
}

Properties

Field	Type	Description
type	string	Must be `"response.content_part.done"`
response_id	string	ID of the response
item_id	string	ID of the item this content part belongs to
output_index	integer	Index of the item in the response
content_index	integer	Index of this content part in the item
part	RealtimeContentPart	The completed content part

response.text.delta

Streaming text content from the model. Sent incrementally as the model generates text.

Event Structure

{
  "type": "response.text.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "delta": "Hello! I'm"
}

Properties

Field	Type	Description
type	string	Must be `"response.text.delta"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
delta	string	Incremental text content

response.text.done

Sent when text content generation is complete.

Event Structure

{
  "type": "response.text.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "text": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}

Properties

Field	Type	Description
type	string	Must be `"response.text.done"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
text	string	The complete text content

response.audio.delta

Streaming audio content from the model. Audio is provided as base64-encoded data.

Event Structure

{
  "type": "response.audio.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "delta": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA="
}

Properties

Field	Type	Description
type	string	Must be `"response.audio.delta"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
delta	string	Base64-encoded audio data chunk

response.audio.done

Sent when audio content generation is complete.

Event Structure

{
  "type": "response.audio.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	Must be `"response.audio.done"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part

response.audio_transcript.delta

Streaming transcript of the generated audio content.

Event Structure

{
  "type": "response.audio_transcript.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "delta": "Hello! I'm doing"
}

Properties

Field	Type	Description
type	string	Must be `"response.audio_transcript.delta"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
delta	string	Incremental transcript text

response.audio_transcript.done

Sent when audio transcript generation is complete.

Event Structure

{
  "type": "response.audio_transcript.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "transcript": "Hello! I'm doing well, thank you for asking. How can I help you today?"
}

Properties

Field	Type	Description
type	string	Must be `"response.audio_transcript.done"`
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
transcript	string	The complete transcript text

conversation.item.input_audio_transcription.completed

The server conversation.item.input_audio_transcription.completed event is the result of audio transcription for speech written to the audio buffer.

Transcription begins when the input audio buffer is committed by the client or server (in server_vad mode). Transcription runs asynchronously with response creation, so this event can come before or after the response events.

Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate speech recognition model such as whisper-1. Thus the transcript can diverge somewhat from the model's interpretation, and should be treated as a rough guide.

Event structure

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "<item_id>",
  "content_index": 0,
  "transcript": "<transcript>"
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.input_audio_transcription.completed`.
item_id	string	The ID of the user message item containing the audio.
content_index	integer	The index of the content part containing the audio.
transcript	string	The transcribed text.
logprobs	array of LogProbProperties	Optional. The log probabilities of the transcription tokens.
phrases	array of TranscriptionPhrase	Optional. The transcription phrases with timing information.

conversation.item.input_audio_transcription.delta

The server conversation.item.input_audio_transcription.delta event is returned when input audio transcription is configured, and a transcription request for a user message is in progress. This event provides partial transcription results as they become available.

Event structure

{
  "type": "conversation.item.input_audio_transcription.delta",
  "item_id": "<item_id>",
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.input_audio_transcription.delta`.
item_id	string	The ID of the user message item.
content_index	integer	The index of the content part containing the audio.
delta	string	The incremental transcription text.

conversation.item.input_audio_transcription.failed

The server conversation.item.input_audio_transcription.failed event is returned when input audio transcription is configured, and a transcription request for a user message failed. This event is separate from other error events so that the client can identify the related item.

Event structure

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

Properties

Field	Type	Description
type	string	The event type must be `conversation.item.input_audio_transcription.failed`.
item_id	string	The ID of the user message item.
content_index	integer	The index of the content part containing the audio.
error	object	Details of the transcription error. See nested properties in the next table.

Error properties

Field	Type	Description
type	string	The type of error.
code	string	Error code, if any.
message	string	A human-readable error message.
param	string	Parameter related to the error, if any.

response.animation_blendshapes.delta

The server response.animation_blendshapes.delta event is returned when the model generates animation blendshapes data as part of a response. This event provides incremental blendshapes data as it becomes available.

Event structure

{
  "type": "response.animation_blendshapes.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "frame_index": 0,
  "frames": [
    [0.0, 0.1, 0.2, ..., 1.0]
    ...
  ]
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_blendshapes.delta`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
frame_index	integer	Index of the first frame in this batch of frames
frames	array of array of float	Array of blendshape frames, each frame is an array of blendshape values

response.animation_blendshapes.done

The server response.animation_blendshapes.done event is returned when the model has finished generating animation blendshapes data as part of a response.

Event structure

{
  "type": "response.animation_blendshapes.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_blendshapes.done`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response

response.audio_timestamp.delta

The server response.audio_timestamp.delta event is returned when the model generates audio timestamp data as part of a response. This event provides incremental timestamp data for output audio and text alignment as it becomes available.

Event structure

{
  "type": "response.audio_timestamp.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "audio_offset_ms": 0,
  "audio_duration_ms": 500,
  "text": "Hello",
  "timestamp_type": "word"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_timestamp.delta`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
audio_offset_ms	integer	Audio offset in milliseconds from the start of the audio
audio_duration_ms	integer	Duration of the audio segment in milliseconds
text	string	The text segment corresponding to this audio timestamp
timestamp_type	string	The type of timestamp, currently only "word" is supported

response.audio_timestamp.done

Sent when audio timestamp generation is complete.

Event Structure

{
  "type": "response.audio_timestamp.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_timestamp.done`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part

response.animation_viseme.delta

The server response.animation_viseme.delta event is returned when the model generates animation viseme data as part of a response. This event provides incremental viseme data as it becomes available.

Event Structure

{
  "type": "response.animation_viseme.delta",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0,
  "audio_offset_ms": 0,
  "viseme_id": 1
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_viseme.delta`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part
audio_offset_ms	integer	Audio offset in milliseconds from the start of the audio
viseme_id	integer	The viseme ID corresponding to the mouth shape for animation

response.animation_viseme.done

The server response.animation_viseme.done event is returned when the model has finished generating animation viseme data as part of a response.

Event Structure

{
  "type": "response.animation_viseme.done",
  "response_id": "resp_ABC123",
  "item_id": "item_DEF456",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.animation_viseme.done`.
response_id	string	ID of the response
item_id	string	ID of the item
output_index	integer	Index of the item in the response
content_index	integer	Index of the content part

error

The server error event is returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session stays open.

Event structure

{
  "type": "error",
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>",
    "event_id": "<event_id>"
  }
}

Properties

Field	Type	Description
type	string	The event type must be `error`.
error	object	Details of the error. See nested properties in the next table.

Error properties

Field	Type	Description
type	string	The type of error. For example, "invalid_request_error" and "server_error" are error types.
code	string	Error code, if any.
message	string	A human-readable error message.
param	string	Parameter related to the error, if any.
event_id	string	The ID of the client event that caused the error, if applicable.

warning

The server warning event is returned when a warning occurs that doesn't interrupt the conversation flow. Warnings are informational and the session continues normally.

Event structure

{
  "type": "warning",
  "warning": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

Properties

Field	Type	Description
type	string	The event type must be `warning`.
warning	object	Details of the warning. See nested properties in the next table.

Warning properties

Field	Type	Description
message	string	A human-readable warning message.
code	string	Optional. Warning code, if any.
param	string	Optional. Parameter related to the warning, if any.

input_audio_buffer.cleared

The server input_audio_buffer.cleared event is returned when the client clears the input audio buffer with a input_audio_buffer.clear event.

Event structure

{
  "type": "input_audio_buffer.cleared"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.cleared`.

input_audio_buffer.committed

The server input_audio_buffer.committed event is returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id property is the ID of the user message item created. Thus a conversation.item.created event is also sent to the client.

Event structure

{
  "type": "input_audio_buffer.committed",
  "previous_item_id": "<previous_item_id>",
  "item_id": "<item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.committed`.
previous_item_id	string	The ID of the preceding item after which the new item is inserted.
item_id	string	The ID of the user message item created.

input_audio_buffer.speech_started

The server input_audio_buffer.speech_started event is returned in server_vad mode when speech is detected in the audio buffer. This event can happen any time audio is added to the buffer (unless speech is already detected).

Note

The client might want to use this event to interrupt audio playback or provide visual feedback to the user.

The client should expect to receive a input_audio_buffer.speech_stopped event when speech stops. The item_id property is the ID of the user message item created when speech stops. The item_id is also included in the input_audio_buffer.speech_stopped event unless the client manually commits the audio buffer during VAD activation.

Event structure

{
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 0,
  "item_id": "<item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.speech_started`.
audio_start_ms	integer	Milliseconds from the start of all audio written to the buffer during the session when speech was first detected. This property corresponds to the beginning of audio sent to the model, and thus includes the `prefix_padding_ms` configured in the session.
item_id	string	The ID of the user message item created when speech stops.

input_audio_buffer.speech_stopped

The server input_audio_buffer.speech_stopped event is returned in server_vad mode when the server detects the end of speech in the audio buffer.

The server also sends a conversation.item.created event with the user message item created from the audio buffer.

Event structure

{
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 0,
  "item_id": "<item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `input_audio_buffer.speech_stopped`.
audio_end_ms	integer	Milliseconds since the session started when speech stopped. This property corresponds to the end of audio sent to the model, and thus includes the `min_silence_duration_ms` configured in the session.
item_id	string	The ID of the user message item created.

rate_limits.updated

The server rate_limits.updated event is emitted at the beginning of a response to indicate the updated rate limits.

When a response is created, some tokens are reserved for the output tokens. The rate limits shown here reflect that reservation, which is then adjusted accordingly once the response is completed.

Event structure

{
  "type": "rate_limits.updated",
  "rate_limits": [
    {
      "name": "<name>",
      "limit": 0,
      "remaining": 0,
      "reset_seconds": 0
    }
  ]
}

Properties

Field	Type	Description
type	string	The event type must be `rate_limits.updated`.
rate_limits	array of RealtimeRateLimitsItem	The list of rate limit information.

response.audio.delta

The server response.audio.delta event is returned when the model-generated audio is updated.

Event structure

{
  "type": "response.audio.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
delta	string	Base64-encoded audio data delta.

response.audio.done

The server response.audio.done event is returned when the model-generated audio is done.

This event is also returned when a response is interrupted, incomplete, or cancelled.

Event structure

{
  "type": "response.audio.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.

response.audio_transcript.delta

The server response.audio_transcript.delta event is returned when the model-generated transcription of audio output is updated.

Event structure

{
  "type": "response.audio_transcript.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_transcript.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
delta	string	The transcript delta.

response.audio_transcript.done

The server response.audio_transcript.done event is returned when the model-generated transcription of audio output is done streaming.

This event is also returned when a response is interrupted, incomplete, or cancelled.

Event structure

{
  "type": "response.audio_transcript.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "transcript": "<transcript>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_transcript.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
transcript	string	The final transcript of the audio.

response.function_call_arguments.delta

The server response.function_call_arguments.delta event is returned when the model-generated function call arguments are updated.

Event structure

{
  "type": "response.function_call_arguments.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "call_id": "<call_id>",
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.function_call_arguments.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the function call item.
output_index	integer	The index of the output item in the response.
call_id	string	The ID of the function call.
delta	string	The arguments delta as a JSON string.

response.function_call_arguments.done

The server response.function_call_arguments.done event is returned when the model-generated function call arguments are done streaming.

This event is also returned when a response is interrupted, incomplete, or cancelled.

Event structure

{
  "type": "response.function_call_arguments.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "call_id": "<call_id>",
  "arguments": "<arguments>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.function_call_arguments.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the function call item.
output_index	integer	The index of the output item in the response.
call_id	string	The ID of the function call.
arguments	string	The final arguments as a JSON string.

mcp_list_tools.in_progress

The server mcp_list_tools.in_progress event is returned when the service starts listing available tools from an MCP server.

Event structure

{
  "type": "mcp_list_tools.in_progress",
  "item_id": "<mcp_list_tools_item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `mcp_list_tools.in_progress`.
item_id	string	The ID of the MCP list tools item being processed.

mcp_list_tools.completed

The server mcp_list_tools.completed event is returned when the service completes listing available tools from an MCP server.

Event structure

{
  "type": "mcp_list_tools.completed",
  "item_id": "<mcp_list_tools_item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `mcp_list_tools.completed`.
item_id	string	The ID of the MCP list tools item being processed.

mcp_list_tools.failed

The server mcp_list_tools.failed event is returned when the service fails to list available tools from an MCP server.

Event structure

{
  "type": "mcp_list_tools.failed",
  "item_id": "<mcp_list_tools_item_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `mcp_list_tools.failed`.
item_id	string	The ID of the MCP list tools item being processed.

response.mcp_call_arguments.delta

The server response.mcp_call_arguments.delta event is returned when the model-generated MCP tool call arguments are updated.

Event structure

{
  "type": "response.mcp_call_arguments.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call_arguments.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the MCP tool call item.
output_index	integer	The index of the output item in the response.
delta	string	The arguments delta as a JSON string.

response.mcp_call_arguments.done

The server response.mcp_call_arguments.done event is returned when the model-generated MCP tool call arguments are done streaming.

Event structure

{
  "type": "response.mcp_call_arguments.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "arguments": "<arguments>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call_arguments.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the MCP tool call item.
output_index	integer	The index of the output item in the response.
arguments	string	The final arguments as a JSON string.

response.mcp_call.in_progress

The server response.mcp_call.in_progress event is returned when an MCP tool call starts processing.

Event structure

{
  "type": "response.mcp_call.in_progress",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call.in_progress`.
item_id	string	The ID of the MCP tool call item.
output_index	integer	The index of the output item in the response.

response.mcp_call.completed

The server response.mcp_call.completed event is returned when an MCP tool call completes successfully.

Event structure

{
  "type": "response.mcp_call.completed",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call.completed`.
item_id	string	The ID of the MCP tool call item.
output_index	integer	The index of the output item in the response.

response.mcp_call.failed

The server response.mcp_call.failed event is returned when an MCP tool call fails.

Event structure

{
  "type": "response.mcp_call.failed",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.mcp_call.failed`.
item_id	string	The ID of the MCP tool call item.
output_index	integer	The index of the output item in the response.

response.foundry_agent_call_arguments.delta

The server response.foundry_agent_call_arguments.delta event is returned when the model-generated foundry agent call arguments are updated.

Event structure

{
  "type": "response.foundry_agent_call_arguments.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.foundry_agent_call_arguments.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the foundry agent call item.
output_index	integer	The index of the output item in the response.
delta	string	The arguments delta as a JSON string.

response.foundry_agent_call_arguments.done

The server response.foundry_agent_call_arguments.done event is returned when the model-generated foundry agent call arguments are done streaming.

Event structure

{
  "type": "response.foundry_agent_call_arguments.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "arguments": "<arguments>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.foundry_agent_call_arguments.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the foundry agent call item.
output_index	integer	The index of the output item in the response.
arguments	string	The final arguments as a JSON string.

response.foundry_agent_call.in_progress

The server response.foundry_agent_call.in_progress event is returned when a foundry agent call starts processing.

Event structure

{
  "type": "response.foundry_agent_call.in_progress",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.foundry_agent_call.in_progress`.
item_id	string	The ID of the foundry agent call item.
agent_response_id	string	The response ID from the foundry agent.
output_index	integer	The index of the output item in the response.

response.foundry_agent_call.completed

The server response.foundry_agent_call.completed event is returned when a foundry agent call completes successfully.

Event structure

{
  "type": "response.foundry_agent_call.completed",
  "item_id": "<item_id>",
  "agent_response_id": "<agent_response_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.foundry_agent_call.completed`.
item_id	string	The ID of the foundry agent call item.
output_index	integer	The index of the output item in the response.

response.foundry_agent_call.failed

The server response.foundry_agent_call.failed event is returned when a foundry agent call fails.

Event structure

{
  "type": "response.foundry_agent_call.failed",
  "item_id": "<item_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.foundry_agent_call.failed`.
item_id	string	The ID of the foundry agent call item.
output_index	integer	The index of the output item in the response.

response.output_item.added

The server response.output_item.added event is returned when a new item is created during response generation.

Event structure

{
  "type": "response.output_item.added",
  "response_id": "<response_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.output_item.added`.
response_id	string	The ID of the response to which the item belongs.
output_index	integer	The index of the output item in the response.
item	RealtimeConversationResponseItem	The item that was added.

response.output_item.done

The server response.output_item.done event is returned when an item is done streaming.

This event is also returned when a response is interrupted, incomplete, or cancelled.

Event structure

{
  "type": "response.output_item.done",
  "response_id": "<response_id>",
  "output_index": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.output_item.done`.
response_id	string	The ID of the response to which the item belongs.
output_index	integer	The index of the output item in the response.
item	RealtimeConversationResponseItem	The item that is done streaming.

response.text.delta

The server response.text.delta event is returned when the model-generated text is updated. The text corresponds to the text content part of an assistant message item.

Event structure

{
  "type": "response.text.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.text.delta`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
delta	string	The text delta.

response.text.done

The server response.text.done event is returned when the model-generated text is done streaming. The text corresponds to the text content part of an assistant message item.

This event is also returned when a response is interrupted, incomplete, or cancelled.

Event structure

{
  "type": "response.text.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "text": "<text>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.text.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
text	string	The final text content.

session.avatar.switch_to_speaking

Returned when the avatar transitions to the speaking state. Use this event to coordinate UI changes such as showing a speaking indicator.

Event structure

{
  "type": "session.avatar.switch_to_speaking",
  "turn_id": "<turn_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `session.avatar.switch_to_speaking`.
turn_id	string	Optional. The ID of the turn associated with the avatar state change.

session.avatar.switch_to_idle

Returned when the avatar transitions to the idle state.

Event structure

{
  "type": "session.avatar.switch_to_idle",
  "turn_id": "<turn_id>"
}

Properties

Field	Type	Description
type	string	The event type must be `session.avatar.switch_to_idle`.
turn_id	string	Optional. The ID of the turn associated with the avatar state change.

response.video.delta

Returned when avatar video frame data is streamed to the client. The frame payload is base64-encoded and uses the codec indicated by the codec field.

Event structure

{
  "type": "response.video.delta",
  "output_index": 0,
  "codec": "h264",
  "delta": "<base64_encoded_video_frame>"
}

Properties

Field	Type	Description
type	string	The event type must be `response.video.delta`.
output_index	integer	The index of the output item in the response.
codec	string	The codec used for the video data (for example, `h264`).
delta	string	The base64-encoded video frame data.

response.web_search_call.searching

Returned when a web search tool call enters the searching state.

Event structure

{
  "type": "response.web_search_call.searching",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "sequence_number": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.web_search_call.searching`.
response_id	string	The ID of the response.
item_id	string	The ID of the web search call item.
output_index	integer	The index of the output item in the response.
sequence_number	integer	The sequence number of the web search call.

response.web_search_call.in_progress

Returned when a web search tool call is in progress.

Event structure

{
  "type": "response.web_search_call.in_progress",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "sequence_number": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.web_search_call.in_progress`.
response_id	string	The ID of the response.
item_id	string	The ID of the web search call item.
output_index	integer	The index of the output item in the response.
sequence_number	integer	The sequence number of the web search call.

response.web_search_call.completed

Returned when a web search tool call has completed.

Event structure

{
  "type": "response.web_search_call.completed",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "sequence_number": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.web_search_call.completed`.
response_id	string	The ID of the response.
item_id	string	The ID of the web search call item.
output_index	integer	The index of the output item in the response.
sequence_number	integer	The sequence number of the web search call.

response.file_search_call.searching

Returned when a file search tool call enters the searching state.

Event structure

{
  "type": "response.file_search_call.searching",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "sequence_number": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.file_search_call.searching`.
response_id	string	The ID of the response.
item_id	string	The ID of the file search call item.
output_index	integer	The index of the output item in the response.
sequence_number	integer	The sequence number of the file search call.

response.file_search_call.in_progress

Returned when a file search tool call is in progress.

Event structure

{
  "type": "response.file_search_call.in_progress",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "sequence_number": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.file_search_call.in_progress`.
response_id	string	The ID of the response.
item_id	string	The ID of the file search call item.
output_index	integer	The index of the output item in the response.
sequence_number	integer	The sequence number of the file search call.

response.file_search_call.completed

Returned when a file search tool call has completed.

Event structure

{
  "type": "response.file_search_call.completed",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "sequence_number": 0
}

Properties

Field	Type	Description
type	string	The event type must be `response.file_search_call.completed`.
response_id	string	The ID of the response.
item_id	string	The ID of the file search call item.
output_index	integer	The index of the output item in the response.
sequence_number	integer	The sequence number of the file search call.

output_audio_buffer.cleared

Returned when the output audio buffer is cleared in response to a client output_audio_buffer.clear event. In the current preview, this event is only emitted in avatar mode.

Event structure

{
  "type": "output_audio_buffer.cleared"
}

Properties

Field	Type	Description
type	string	The event type must be `output_audio_buffer.cleared`.

response.audio_transcript.annotation.added

Returned when an annotation (for example, a citation produced by a web or file search tool) is added to an audio transcript content part.

Event structure

{
  "type": "response.audio_transcript.annotation.added",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "annotation_index": 0,
  "annotation": {}
}

Properties

Field	Type	Description
type	string	The event type must be `response.audio_transcript.annotation.added`.
response_id	string	The ID of the response.
item_id	string	The ID of the item.
output_index	integer	The index of the output item in the response.
content_index	integer	The index of the content part in the item's content array.
annotation_index	integer	The index of the annotation.
annotation	object	The annotation object. The schema depends on the annotation source (for example, web search citation).

Components

Audio Formats

RealtimeAudioFormat

Base audio format used for input audio.

Allowed Values:

pcm16 - 16-bit PCM audio format
g711_ulaw - G.711 μ-law audio format
g711_alaw - G.711 A-law audio format

RealtimeOutputAudioFormat

Audio format used for output audio with specific sampling rates.

Allowed Values:

pcm16 - 16-bit PCM audio format at default sampling rate (24kHz)
pcm16_8000hz - 16-bit PCM audio format at 8kHz sampling rate
pcm16_16000hz - 16-bit PCM audio format at 16kHz sampling rate
g711_ulaw - G.711 μ-law (mu-law) audio format at 8kHz sampling rate
g711_alaw - G.711 A-law audio format at 8kHz sampling rate

RealtimeAudioInputTranscriptionSettings

Configuration for input audio transcription.

Field	Type	Description
model	string	The transcription model. Supported with `gpt-realtime` and `gpt-realtime-mini`: `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-transcribe-diarize`, `mai-transcribe-1`. Supported with all other models and agents: `azure-speech` and `mai-transcribe-1`
language	string	Optional language code in BCP-47 (for example, `en-US`), or ISO-639-1 (for example, `en`), or multi languages with auto detection (for example, `en,zh`). See Azure speech to text supported languages for recommended usage of this setting.
custom_speech	object	Optional configuration for custom speech models, only valid for `azure-speech` model.
phrase_list	string[]	Optional list of phrase hints to bias recognition, only valid for `azure-speech` model.
prompt	string	Optional prompt text to guide transcription, only valid for `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe` and `gpt-4o-transcribe-diarize` models.

RealtimeInputAudioNoiseReductionSettings

This can be:

An RealtimeOpenAINoiseReduction object
An RealtimeAzureDeepNoiseSuppression object

RealtimeOpenAINoiseReduction

OpenAI noise reduction configuration with explicit type field, only available for gpt-realtime and gpt-realtime-mini models.

Field	Type	Description
type	string	`near_field` or `far_field`

RealtimeAzureDeepNoiseSuppression

Configuration for input audio noise reduction.

Field	Type	Description
type	string	Must be `"azure_deep_noise_suppression"`

RealtimeInputAudioEchoCancellationSettings

Echo cancellation configuration for server-side audio processing.

Field	Type	Description
type	string	Must be `"server_echo_cancellation"`

Voice Configuration

RealtimeVoice

Union of all supported voice configurations.

This can be:

An RealtimeOpenAIVoice object
An RealtimeAzureVoice object

RealtimeOpenAIVoice

OpenAI voice configuration with explicit type field.

Field	Type	Description
type	string	Must be `"openai"`
name	string	OpenAI voice name: `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, `marin`, `cedar`

RealtimeAzureVoice

Base for Azure voice configurations. This is a discriminated union with different types:

RealtimeAzureStandardVoice

Azure standard voice configuration.

Field	Type	Description
type	string	Must be `"azure-standard"`
name	string	Voice name (can't be empty)
temperature	number	Optional. Temperature between 0.0 and 1.0
custom_lexicon_url	string	Optional. URL to custom lexicon
custom_text_normalization_url	string	Optional. URL to custom text normalization
prefer_locales	string[]	Optional. Preferred locales Prefer locales change the accents of languages. If the value isn't set, TTS uses default accent of each language. For example when TTS speaking English, it uses the American English accent. And when speaking Spanish, it uses the Mexican Spanish accent. If set the prefer_locales to `["en-GB", "es-ES"]`, the English accent is British English and the Spanish accent is European Spanish. And TTS also able to speak other languages like French, Chinese, etc.
locale	string	Optional. Locale specification Enforce The locale for TTS output. If not set, TTS always uses the given locale to speak. For example set locale to `en-US`, TTS always uses American English accent to speak the text content, even the text content is in another language. And TTS will output silence if the text content is in Chinese.
style	string	Optional. Voice style
pitch	string	Optional. Pitch adjustment for the voice output. Follows the same rules as the `pitch` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`x-low`, `low`, `medium`, `high`, `x-high`, `default`), a relative change (for example `+10%`, `-5%`, `+50Hz`, `-2st`), or an absolute frequency (for example `200Hz`).
rate	string	Optional. Speaking rate adjustment for the voice output. Follows the same rules as the `rate` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`x-slow`, `slow`, `medium`, `fast`, `x-fast`, `default`), a relative percentage (for example `+20%`, `-10%`), or a non-negative multiplier (for example `0.5`, `1.5`).
volume	string	Optional. Volume adjustment for the voice output. Follows the same rules as the `volume` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`silent`, `x-soft`, `soft`, `medium`, `loud`, `x-loud`, `default`), an absolute number from 0.0 to 100.0, or a relative change (for example `+10`, `-6dB`).

RealtimeAzureCustomVoice

Azure custom voice configuration (preferred for custom voices).

Field	Type	Description
type	string	Must be `"azure-custom"`
name	string	Voice name (can't be empty)
endpoint_id	string	Endpoint ID (can't be empty)
temperature	number	Optional. Temperature between 0.0 and 1.0
custom_lexicon_url	string	Optional. URL to custom lexicon
custom_text_normalization_url	string	Optional. URL to custom text normalization
prefer_locales	string[]	Optional. Preferred locales Prefer locales change the accents of languages. If the value isn't set, TTS uses default accent of each language. For example When TTS speaking English, it uses the American English accent. And when speaking Spanish, it uses the Mexican Spanish accent. If set the prefer_locales to `["en-GB", "es-ES"]`, the English accent is British English and the Spanish accent is European Spanish. And TTS also able to speak other languages like French, Chinese, etc.
locale	string	Optional. Locale specification Enforce The locale for TTS output. If not set, TTS always uses the given locale to speak. For example set locale to `en-US`, TTS always uses American English accent to speak the text content, even the text content is in another language. And TTS will output silence if the text content is in Chinese.
style	string	Optional. Voice style
pitch	string	Optional. Pitch adjustment for the voice output. Follows the same rules as the `pitch` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`x-low`, `low`, `medium`, `high`, `x-high`, `default`), a relative change (for example `+10%`, `-5%`, `+50Hz`, `-2st`), or an absolute frequency (for example `200Hz`).
rate	string	Optional. Speaking rate adjustment for the voice output. Follows the same rules as the `rate` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`x-slow`, `slow`, `medium`, `fast`, `x-fast`, `default`), a relative percentage (for example `+20%`, `-10%`), or a non-negative multiplier (for example `0.5`, `1.5`).
volume	string	Optional. Volume adjustment for the voice output. Follows the same rules as the `volume` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`silent`, `x-soft`, `soft`, `medium`, `loud`, `x-loud`, `default`), an absolute number from 0.0 to 100.0, or a relative change (for example `+10`, `-6dB`).

Example:

{
  "type": "azure-custom",
  "name": "my-custom-voice",
  "endpoint_id": "12345678-1234-1234-1234-123456789012",
  "temperature": 0.7,
  "style": "cheerful",
  "locale": "en-US"
}

RealtimeAzurePersonalVoice

Azure personal voice configuration.

Field	Type	Description
type	string	Must be `"azure-personal"`
name	string	Voice name (can't be empty)
temperature	number	Optional. Temperature between 0.0 and 1.0
model	string	Underlying base model: `DragonLatestNeural`, `DragonHDOmniLatestNeural`, `MAI-Voice-1`
custom_lexicon_url	string	Optional. URL to custom lexicon
custom_text_normalization_url	string	Optional. URL to custom text normalization
prefer_locales	string[]	Optional. Preferred locales Prefer locales change the accents of languages. If the value isn't set, TTS uses default accent of each language. For example when TTS speaking English, it uses the American English accent. And when speaking Spanish, it uses the Mexican Spanish accent. If set the prefer_locales to `["en-GB", "es-ES"]`, the English accent is British English and the Spanish accent is European Spanish. And TTS also able to speak other languages like French, Chinese, etc.
locale	string	Optional. Locale specification Enforce The locale for TTS output. If not set, TTS always uses the given locale to speak. For example set locale to `en-US`, TTS always uses American English accent to speak the text content, even the text content is in another language. And TTS will output silence if the text content is in Chinese.
pitch	string	Optional. Pitch adjustment for the voice output. Follows the same rules as the `pitch` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`x-low`, `low`, `medium`, `high`, `x-high`, `default`), a relative change (for example `+10%`, `-5%`, `+50Hz`, `-2st`), or an absolute frequency (for example `200Hz`).
rate	string	Optional. Speaking rate adjustment for the voice output. Follows the same rules as the `rate` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`x-slow`, `slow`, `medium`, `fast`, `x-fast`, `default`), a relative percentage (for example `+20%`, `-10%`), or a non-negative multiplier (for example `0.5`, `1.5`).
volume	string	Optional. Volume adjustment for the voice output. Follows the same rules as the `volume` attribute of the SSML `prosody` element (see Adjust prosody). Typical values: a named level (`silent`, `x-soft`, `soft`, `medium`, `loud`, `x-loud`, `default`), an absolute number from 0.0 to 100.0, or a relative change (for example `+10`, `-6dB`).

RealtimeAzureRealtimeNativeVoice

Voice configuration for the azure-realtime model. The azure-realtime model accepts only azure-realtime-native voices, and azure-realtime-native voices aren't accepted by other models.

Field	Type	Description
type	string	Must be `"azure-realtime-native"`
name	string	Voice name. One of `aarti`, `andrew`, `ava` (default), `denise`, `elsa`, `florian`, `francisca`, `meera`, `ximena`, `xiaoxiao`, `yunxi`. If not specified, `ava` is used.

Example:

{
  "voice": {
    "type": "azure-realtime-native",
    "name": "ava"
  }
}

Turn Detection

RealtimeTurnDetection

Configuration for turn detection. This is a discriminated union supporting multiple VAD types.

RealtimeServerVAD

Base VAD-based turn detection.

Field	Type	Description
type	string	Must be `"server_vad"`
threshold	float	Optional. Activation threshold (0.0-1.0) (default: 0.5)
prefix_padding_ms	integer	Optional. Audio padding before speech starts (default: 300)
silence_duration_ms	integer	Optional. Silence duration to detect speech end (default: 500)
speech_duration_ms	integer	Optional. Minimum speech duration (default: 200)
end_of_utterance_detection	RealtimeEOUDetection	Optional. End-of-utterance detection config
create_response	boolean	Optional. Enable or disable whether a response is generated (default: true).
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: true).
auto_truncate	boolean	Optional. Auto-truncate on interruption (default: false)

RealtimeOpenAISemanticVAD

OpenAI semantic VAD configuration which uses a model to determine when the user has finished speaking. Only available for gpt-realtime and gpt-realtime-mini models.

Field	Type	Description
type	string	Must be `"semantic_vad"`
eagerness	string	Optional. This is a way to control how eager the model is to interrupt the user, tuning the maximum wait timeout. In transcription mode, even if the model doesn't reply, it affects how the audio is chunked. The following values are allowed: - `auto` (default) is equivalent to `medium`, - `low` lets the user take their time to speak, - `high` will chunk the audio as soon as possible. If you want the model to respond more often in conversation mode, or to return transcription events faster in transcription mode, you can set eagerness to `high`. On the other hand, if you want to let the user speak uninterrupted in conversation mode, or if you would like larger transcript chunks in transcription mode, you can set eagerness to `low`.
create_response	boolean	Optional. Enable or disable whether a response is generated (default: true).
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: true).

RealtimeAzureSemanticVAD

Azure semantic VAD, which determines when the user starts and speaking using a semantic speech model, providing more robust detection in noisy environments.

Field	Type	Description
type	string	Must be `"azure_semantic_vad"`
threshold	float	Optional. Activation threshold (default: 0.5)
prefix_padding_ms	integer	Optional. Audio padding before speech (default: 300)
silence_duration_ms	integer	Optional. Silence duration for speech end (default: 500)
end_of_utterance_detection	RealtimeEOUDetection	Optional. EOU detection config
speech_duration_ms	integer	Optional. Minimum speech duration (default: 80)
remove_filler_words	boolean	Optional. Remove filler words (default: false)
languages	string[]	Optional. Supports English. Other languages are ignored (default: none).
create_response	boolean	Optional. Enable or disable whether a response is generated (default: true).
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: true).
auto_truncate	boolean	Optional. Auto-truncate on interruption (default: false)

RealtimeAzureSemanticVADMultilingual

Azure semantic VAD (default variant).

Field	Type	Description
type	string	Must be `"azure_semantic_vad_multilingual"`
threshold	float	Optional. Activation threshold (default: 0.5)
prefix_padding_ms	integer	Optional. Audio padding before speech (default: 300)
silence_duration_ms	integer	Optional. Silence duration for speech end (default: 500)
end_of_utterance_detection	RealtimeEOUDetection	Optional. EOU detection config
speech_duration_ms	integer	Optional. Minimum speech duration (default: 80)
remove_filler_words	boolean	Optional. Remove filler words (default: false)
languages	string[]	Optional. Supports English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi. Other languages are ignored (default: none).
create_response	boolean	Optional. Enable or disable whether a response is generated (default: true).
interrupt_response	boolean	Optional. Enable or disable barge-in interruption (default: true).
auto_truncate	boolean	Optional. Auto-truncate on interruption (default: false)

SmartEndOfTurnDetection

Audio-based end-of-turn (EOU) detection. Operates directly on the input audio stream rather than text. Use threshold_level and timeout_ms to tune detection.

Field	Type	Description
model	string	Must be `"smart_end_of_turn_detection"`
threshold_level	string	Optional. Threshold level setting. One of `low`, `medium`, `high`, or `default`.
timeout_ms	integer	Optional. Maximum time in milliseconds to wait for more user speech before triggering end-of-turn.

RealtimeEOUDetection

Azure End-of-Utterance (EOU) could indicate when the end-user stopped speaking while allowing for natural pauses. End of utterance detection can significantly reduce premature end-of-turn signals without adding user-perceivable latency.

Field	Type	Description
model	string	Could be `semantic_detection_v1` supporting English or `semantic_detection_v1_multilingual` supporting English, Spanish, French, Italian, German (DE), Japanese, Portuguese, Chinese, Korean, Hindi
threshold_level	string	Optional. Detection threshold level (`low`, `medium`, `high` and `default`), the default equals `medium` setting. With a lower setting the probability the sentence is complete will be higher.
timeout_ms	number	Optional. Maximum time in milliseconds to wait for more user speech. Defaults to 1000 ms.

Avatar Configuration

RealtimeAvatarConfig

Configuration for avatar streaming and behavior.

Field	Type	Description
type	string	Optional. Avatar type. Allowed values: `video-avatar`, `photo-avatar`. Default is `video-avatar`
ice_servers	RealtimeIceServer[]	Optional. ICE servers for WebRTC
character	string	Character name or ID for the avatar
style	string	Optional. Avatar style (emotional tone, speaking style)
customized	boolean	Whether the avatar is customized
model	string	Optional. Base model name for the photo avatar, required if type is `photo-avatar`, valid value is `vasa-1`
video	RealtimeVideoParams	Optional. Video configuration
scene	RealtimeAvatarScene	Optional. Configuration for the avatar's zoom level, position, rotation and movement amplitude in the video frame
output_protocol	string	Optional. Output protocol for avatar streaming. Allowed values: `websocket` and `webrtc`. Default is `webrtc`
output_audit_audio	boolean	Optional. When enabled, forwards audit audio via WebSocket for review/debugging purposes, even when avatar output is delivered via WebRTC. Default is `false`

RealtimeIceServer

ICE server configuration for WebRTC connection negotiation.

Field	Type	Description
urls	string[]	ICE server URLs (TURN or STUN endpoints)
username	string	Optional. Username for authentication
credential	string	Optional. Credential for authentication

RealtimeVideoParams

Video streaming parameters for avatar.

Field	Type	Description
bitrate	integer	Optional. Bitrate in bits per second (default: 2000000)
codec	string	Optional. Video codec, currently only `h264` (default: `h264`)
crop	RealtimeVideoCrop	Optional. Cropping settings
resolution	RealtimeVideoResolution	Optional. Resolution settings
background	RealtimeVideoBackground	Optional. Background settings
gop_size	integer	Optional. Group of Pictures size (default: 10, range: 1–2000)

RealtimeVideoCrop

Video crop rectangle definition.

Field	Type	Description
top_left	integer[]	Top-left corner [x, y], non-negative integers
bottom_right	integer[]	Bottom-right corner [x, y], non-negative integers

RealtimeVideoResolution

Video resolution specification.

Field	Type	Description
width	integer	Width in pixels (must be > 0)
height	integer	Height in pixels (must be > 0)

RealtimeVideoBackground

Video background configuration. Only one of image_url or color can be set.

Field	Type	Description
image_url	string	Optional. URL to a background image
color	string	Optional. Background color value

RealtimeAvatarScene

Configuration for avatar's zoom level, position, rotation and movement amplitude in the video frame.

Field	Type	Description
zoom	number	Optional. Zoom level of the avatar. Range is (0, +∞). Values less than 1 zoom out, values greater than 1 zoom in. Default is 0
position_x	number	Optional. Horizontal position of the avatar. Range is [-1, 1], as a proportion of frame width. Negative values move left, positive values move right. Default is 0
position_y	number	Optional. Vertical position of the avatar. Range is [-1, 1], as a proportion of frame height. Negative values move up, positive values move down. Default is 0
rotation_x	number	Optional. Rotation around the X-axis (pitch). Range is [-π, π] in radians. Negative values rotate up, positive values rotate down. Default is 0
rotation_y	number	Optional. Rotation around the Y-axis (yaw). Range is [-π, π] in radians. Negative values rotate left, positive values rotate right. Default is 0
rotation_z	number	Optional. Rotation around the Z-axis (roll). Range is [-π, π] in radians. Negative values rotate anticlockwise, positive values rotate clockwise. Default is 0
amplitude	number	Optional. Amplitude of the avatar movement. Range is (0, 1]. Values in (0, 1) mean reduced amplitude, 1 means full amplitude. Default is 0

Animation Configuration

RealtimeAnimation

Configuration for animation outputs including blendshapes and visemes.

Field	Type	Description
model_name	string	Optional. Animation model name (default: `"default"`)
outputs	RealtimeAnimationOutputType[]	Optional. Output types (default: `["blendshapes"]`)

RealtimeAnimationOutputType

Types of animation data to output.

Allowed Values:

blendshapes - Facial blendshapes data
viseme_id - Viseme identifier data

Session Configuration

RealtimeRequestSession

Session configuration object used in session.update events.

Field	Type	Description
model	string	Optional. Model name to use
modalities	RealtimeModality[]	Optional. The supported output modalities for the session. For example, "modalities": ["text", "audio"] is the default setting that enables both text and audio output modalities. To enable only text output, set "modalities": ["text"]. To enable avatar output, set "modalities": ["text", "audio", "avatar"]. You can't enable only audio.
animation	RealtimeAnimation	Optional. Animation configuration
voice	RealtimeVoice	Optional. Voice configuration
instructions	string	Optional. System instructions for the model. The instructions could guide the output audio if OpenAI voices are used but may not apply to Azure voices.
input_audio_sampling_rate	integer	Optional. Input audio sampling rate in Hz (default: 24000 for `pcm16`, 8000 for `g711_ulaw` and `g711_alaw`)
input_audio_format	RealtimeAudioFormat	Optional. Input audio format (default: `pcm16`)
output_audio_format	RealtimeOutputAudioFormat	Optional. Output audio format (default: `pcm16`)
input_audio_noise_reduction	RealtimeInputAudioNoiseReductionSettings	Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. This property is nullable.
input_audio_echo_cancellation	RealtimeInputAudioEchoCancellationSettings	Configuration for input audio echo cancellation. This can be set to null to turn off. This service side echo cancellation can help improve the quality of the input audio by reducing the impact of echo and reverberation. This property is nullable.
input_audio_transcription	RealtimeAudioInputTranscriptionSettings	The configuration for input audio transcription. The configuration is null (off) by default. Input audio transcription isn't native to the model, since the model consumes audio directly. Transcription runs asynchronously through the `/audio/transcriptions` endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. For additional guidance to the transcription service, the client can optionally set the language and prompt for transcription. This property is nullable.
turn_detection	RealtimeTurnDetection	The turn detection settings for the session. This can be set to null to turn off.
tools	array of RealtimeTool	The tools available to the model for the session.
tool_choice	RealtimeToolChoice	The tool choice for the session. Allowed values: `auto`, `none`, and `required`. Otherwise, you can specify the name of the function to use.
parallel_tool_calls	boolean	Optional. Whether the model may issue tool calls in parallel. Defaults to `true`. Set to `false` to require tool calls to be issued sequentially.
temperature	number	The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8.
max_response_output_tokens	integer or "inf"	The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`. Defaults to `"inf"`.
interim-response	InterimResponseConfig	Optional. Configuration for interim response generation during latency or tool calls.
reasoning_effort	ReasoningEffort	Optional. Constrains effort on reasoning for reasoning models. Check Azure Foundry doc for more details. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.
avatar	RealtimeAvatarConfig	Optional. Avatar configuration
output_audio_timestamp_types	RealtimeAudioTimestampType[]	Optional. Timestamp types for output audio
metadata	map	Optional. Set of up to 16 key-value pairs that can be attached to the session. This is useful for storing additional information about the session in a structured format, such as tracking IDs, user context, or application-specific labels. These key-value pairs are also included in Microsoft Foundry resource logs for tracing and diagnostics. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long.

RealtimeModality

Supported session output modalities.

Allowed Values:

text - Text output
audio - Audio output
animation - Animation output
avatar - Avatar video output

RealtimeAudioTimestampType

Output timestamp types supported in audio response content.

Allowed Values:

word - Timestamps per word in the output audio

ReasoningEffort

Constrains effort on reasoning for reasoning models. Check model documentation for supported values for each model. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

Allowed Values:

none - No reasoning effort
minimal - Minimal reasoning effort
low - Low reasoning effort - faster responses with less reasoning
medium - Medium reasoning effort - balanced between speed and reasoning depth
high - High reasoning effort - more thorough reasoning, may take longer
xhigh - Extra high reasoning effort - maximum reasoning depth

Tool Configuration

We support two types of tools: function calling and MCP tools which allow you connect to an MCP server.

RealtimeTool

Tool definition for function calling.

Field	Type	Description
type	string	Must be `"function"`
name	string	Function name
description	string	Function description and usage guidelines
parameters	object	Function parameters as JSON schema object

RealtimeToolChoice

Tool selection strategy.

This can be:

"auto" - Let the model choose
"none" - Don't use tools
"required" - Must use a tool
{ "type": "function", "name": "function_name" } - Use specific function

MCPTool

MCP tool configuration.

Field	Type	Description
type	string	Must be `"mcp"`
server_label	string	Required. The label of the MCP server.
server_url	string	Required. The server URL of the MCP server.
allowed_tools	string[]	Optional. The list of allowed tool names. If not specified, all tools are allowed.
headers	object	Optional. Additional headers to include in MCP requests.
authorization	string	Optional. Authorization token for MCP requests.
require_approval	string or dictionary	Optional. If set to a string, The value must be `never` or `always`. If set to a dictionary, it must be in format `{"never": ["<tool_name_1>", "<tool_name_2>"], "always": ["<tool_name_3>"]}`. Default value is `always`. When set to `always`, the tool execution requires approval, mcp_approval_request will be sent to client when MCP argument done, and will only be executed when mcp_approval_response with `approve=true` is received. When set to `never`, the tool will be executed automatically without approval.

FoundryAgentTool

Tool definition for integrating a Foundry agent as a tool. This enables a chat-supervisor pattern where a realtime-based chat agent handles basic interactions while delegating complex tasks to a more intelligent Foundry agent.

Field	Type	Description
type	string	Must be `"foundry_agent"`
agent_name	string	Required. The name of the Foundry agent to call.
agent_version	string	Optional. The version of the Foundry agent to call.
project_name	string	Required. The name of the Foundry project containing the agent.
client_id	string	Optional. The client ID associated with the Foundry agent.
description	string	Optional. An optional description for the Foundry agent tool. If provided, it's used instead of the agent's description in Foundry portal.
foundry_resource_override	string	Optional. Override for the Foundry resource used to execute the agent.
agent_context_type	string	Optional. The context type to use when invoking the Foundry agent. Possible values: `no_context`, `agent_context`. Default is `agent_context`. `no_context`: Only the current user input is sent, no context maintained. `agent_context`: Agent maintains its own context (thread), only current input sent per call.
return_agent_response_directly	boolean	Optional. Whether to return the agent's response directly in the Voice Live response. Default is `true`. When set to `false`, the response is sent to the chat agent to rephrase.

Example:

{
  "instructions": "You are a helpful assistant. Please respond with a short message like 'working on this' before calling the agent tool.",
  "tools": [
    {
      "type": "foundry_agent",
      "agent_name": "customer-service-agent",
      "agent_version": "2",
      "project_name": "my-foundry-project",
      "description": "A helpful agent that can search online information and handle complex customer requests"
    }
  ]
}

Interim response configuration

Interim responses allow the system to generate placeholder audio responses while tools are being executed, improving user experience by avoiding silence.

InterimResponseConfig

Configuration for interim response generation. This is a union type that can be one of the following:

StaticInterimResponseConfig - Pre-generated interim responses selected from a predefined list.
LlmInterimResponseConfig - LLM-generated interim responses.

StaticInterimResponseConfig

Configuration for static interim response generation. Randomly selects from configured texts when any trigger condition is met.

Field	Type	Description
type	string	Must be `"static-interim-response"`.
triggers	InterimResponseTrigger[]	Optional. List of triggers that can fire the interim response. Any trigger can activate the interim response (OR logic). Supported values: `latency`, `tool`. Default is `["latency"]`.
latency_threshold_ms	integer	Optional. Latency threshold in milliseconds before triggering interim response. Default is 2000ms. Minimum value is 0.
texts	string[]	Optional. List of interim response text options to randomly select from.

Example:

{
  "session": {
    "interim-response": {
      "type": "static-interim-response",
      "triggers": ["latency", "tool"],
      "latency_threshold_ms": 1500,
      "texts": [
        "Let me think about that...",
        "One moment please...",
        "Working on that for you..."
      ]
    }
  }
}

LlmInterimResponseConfig

Configuration for LLM-based interim response generation. Uses LLM to generate context-aware interim responses when any trigger condition is met.

Field	Type	Description
type	string	Must be `"llm-interim-response"`.
triggers	InterimResponseTrigger[]	Optional. List of triggers that can fire the interim response. Any trigger can activate the interim response (OR logic). Supported values: `latency`, `tool`. Default is `["latency"]`.
latency_threshold_ms	integer	Optional. Latency threshold in milliseconds before triggering interim response. Default is 2000ms. Minimum value is 0.
model	string	Optional. The model to use for LLM-based interim response generation. Default is `gpt-4.1-mini`. The default model might change without a new API version.
instructions	string	Optional. Custom instructions for generating interim responses. If not provided, a default prompt is used.
max_completion_tokens	integer	Optional. Maximum number of tokens to generate for the interim response. Default is 50. Minimum value is 1.

Example:

{
  "session": {
    "interim-response": {
      "type": "llm-interim-response",
      "triggers": ["tool"],
      "latency_threshold_ms": 2000,
      "model": "gpt-4.1-mini",
      "instructions": "Generate a brief, friendly acknowledgment that you're working on the user's request.",
      "max_completion_tokens": 30
    }
  }
}

InterimResponseTrigger

Triggers that can activate interim response generation.

Allowed Values:

latency - Trigger interim response when response latency exceeds threshold
tool - Trigger interim response when a tool call is being executed

RealtimeConversationResponseItem

This is a union type that can be one of the following:

RealtimeConversationUserMessageItem

User message item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"message"`
object	string	Must be `"conversation.item"`
role	string	Must be `"user"`
content	RealtimeInputTextContentPart	The content of the message.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationAssistantMessageItem

Assistant message item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"message"`
object	string	Must be `"conversation.item"`
role	string	Must be `"assistant"`
content	RealtimeOutputTextContentPart[] or RealtimeOutputAudioContentPart[]	The content of the message.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationSystemMessageItem

System message item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"message"`
object	string	Must be `"conversation.item"`
role	string	Must be `"system"`
content	RealtimeInputTextContentPart[]	The content of the message.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationFunctionCallItem

Function call request item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"function_call"`
object	string	Must be `"conversation.item"`
name	string	The name of the function to call.
arguments	string	The arguments for the function call as a JSON string.
call_id	string	The unique ID of the function call.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationFunctionCallOutputItem

Function call response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"function_call_output"`
object	string	Must be `"conversation.item"`
name	string	The name of the function that was called.
output	string	The output of the function call.
call_id	string	The unique ID of the function call.
status	RealtimeItemStatus	The status of the item.

RealtimeConversationMCPListToolsItem

MCP list tools response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"mcp_list_tools"`
server_label	string	The label of the MCP server.

RealtimeConversationMCPCallItem

MCP call response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"mcp_call"`
server_label	string	The label of the MCP server.
name	string	The name of the tool to call.
approval_request_id	string	The approval request ID for the MCP call.
arguments	string	The arguments for the MCP call.
output	string	The output of the MCP call.
error	object	The error details if the MCP call failed.

RealtimeConversationMCPApprovalRequestItem

MCP approval request item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"mcp_approval_request"`
server_label	string	The label of the MCP server.
name	string	The name of the tool to call.
arguments	string	The arguments for the MCP call.

RealtimeConversationFoundryAgentCallItem

Foundry agent call response item.

Field	Type	Description
id	string	The unique ID of the item.
type	string	Must be `"foundry_agent_call"`
name	string	The name of the Foundry agent.
call_id	string	The ID of the call.
arguments	string	The arguments for the foundry agent call.
agent_response_id	string	Optional. The response ID from the foundry agent.
output	string	Optional. The output of the foundry agent call.
error	object	Optional. The error details if the foundry agent call failed.

RealtimeConversationWebSearchCallItem

Web search call response item.

Field	Type	Description
id	string	The unique ID of the web search tool call.
type	string	Must be `"web_search_call"`
status	string	The status of the web search tool call. One of `in_progress`, `searching`, `completed`, `failed`.

RealtimeConversationFileSearchCallItem

File search call response item.

Field	Type	Description
id	string	The unique ID of the file search tool call.
type	string	Must be `"file_search_call"`
queries	string[]	Optional. The queries used for the file search.
status	string	The status of the file search tool call. One of `in_progress`, `searching`, `completed`, `incomplete`, `failed`.
results	array of FileSearchResult	Optional. The results of the file search.

FileSearchResult

A single file search result entry.

Field	Type	Description
file_id	string	Optional. The unique ID of the file.
filename	string	Optional. The name of the file.
score	number	Optional. The relevance score of the file search result.
text	string	Optional. The text content of the file that matched the query.
attributes	map	Optional. Key-value pairs for filtering file search results.

ActionSearch

A web search action recorded as part of a web search call.

Field	Type	Description
type	string	Must be `"search"`.
query	string	Optional. The search query.
sources	array of ActionSearchSource	Optional. The sources used in the search.

ActionSearchSource

A source URL referenced by a web search action.

Field	Type	Description
type	string	Must be `"url"`.
url	string	The URL of the source.

ActionOpenPage

An open-page action performed by the model during a web search.

Field	Type	Description
type	string	Must be `"open_page"`.
url	string	The URL opened by the model.

ActionFind

A find-in-page action performed by the model during a web search.

Field	Type	Description
type	string	Must be `"find"`.
pattern	string	The pattern or text to search for within the page.
url	string	The URL of the page searched for the pattern.

TranscriptionPhrase

A transcribed phrase with timing information, returned in conversation.item.input_audio_transcription.completed.

Field	Type	Description
offset_milliseconds	integer	Offset from the start of the audio in milliseconds.
duration_milliseconds	integer	Duration of the phrase in milliseconds.
text	string	The transcribed text of the phrase.
words	array of TranscriptionWord	Optional. The individual words in the phrase with timing information.
locale	string	Optional. The locale of the transcription (for example, `en-US`).
confidence	number	Optional. The confidence score of the transcription.

TranscriptionWord

A time-stamped word in a transcription.

Field	Type	Description
text	string	The transcribed word text.
offset_milliseconds	integer	Offset from the start of the audio in milliseconds.
duration_milliseconds	integer	Duration of the word in milliseconds.

LogProbProperties

Log-probability information for a transcription token.

Field	Type	Description
token	string	The token text.
logprob	number	The natural-log probability of the token.
bytes	integer[]	Optional. The UTF-8 byte representation of the token.

RealtimeItemStatus

Status of conversation items.

Allowed Values:

in_progress - Currently being processed
completed - Successfully completed
incomplete - Incomplete (interrupted or failed)

RealtimeContentPart

Content part within a message.

RealtimeInputTextContentPart

Text content part.

Field	Type	Description
type	string	Must be `"input_text"`
text	string	The text content

RealtimeOutputTextContentPart

Text content part.

Field	Type	Description
type	string	Must be `"text"`
text	string	The text content

RealtimeInputAudioContentPart

Audio content part.

Field	Type	Description
type	string	Must be `"input_audio"`
audio	string	Optional. Base64-encoded audio data
transcript	string	Optional. Audio transcript

RealtimeOutputAudioContentPart

Audio content part.

Field	Type	Description
type	string	Must be `"audio"`
audio	string	Base64-encoded audio data
transcript	string	Optional. Audio transcript

RealtimeRequestImageContentPart

Input image content part. Use it in a user message to attach an image alongside text or audio.

Field	Type	Description
type	string	Must be `"input_image"`
image_url	string (URI)	Optional. URL of the image. Starting in `2026-06-01-preview`, this field is named `image_url`. Earlier API versions expose the same field as `url`.
detail	string	Optional. Image detail level.

Response Objects

RealtimeResponse

Response object representing a model inference response.

Field	Type	Description
id	string	Optional. Response ID
object	string	Optional. Always `"realtime.response"`
status	RealtimeResponseStatus	Optional. Response status
status_details	RealtimeResponseStatusDetails	Optional. Status details
output	RealtimeConversationResponseItem[]	Optional. Output items
usage	RealtimeUsage	Optional. Token usage statistics
conversation_id	string	Optional. Associated conversation ID
voice	RealtimeVoice	Optional. Voice used for response
modalities	string[]	Optional. Output modalities used
output_audio_format	RealtimeOutputAudioFormat	Optional. Audio format used
temperature	number	Optional. Temperature used
max_response_output_tokens	integer or "inf"	Optional. Max tokens used

RealtimeResponseStatus

Response status values.

Allowed Values:

in_progress - Response is being generated
completed - Response completed successfully
cancelled - Response was cancelled
incomplete - Response incomplete (interrupted)
failed - Response failed with error

RealtimeUsage

Token usage statistics.

Field	Type	Description
total_tokens	integer	Total tokens used
input_tokens	integer	Input tokens used
output_tokens	integer	Output tokens generated
input_token_details	TokenDetails	Breakdown of input tokens
output_token_details	TokenDetails	Breakdown of output tokens

TokenDetails

Detailed token usage breakdown.

Field	Type	Description
cached_tokens	integer	Optional. Cached tokens used
text_tokens	integer	Optional. Text tokens used
audio_tokens	integer	Optional. Audio tokens used
reasoning_tokens	integer	Optional. Reasoning tokens generated in the output. Applies to output token details only.

Error Handling

RealtimeErrorDetails

Error information object.

Field	Type	Description
type	string	Error type (e.g., `"invalid_request_error"`, `"server_error"`)
code	string	Optional. Specific error code
message	string	Human-readable error description
param	string	Optional. Parameter related to the error
event_id	string	Optional. ID of the client event that caused the error

RealtimeConversationRequestItem

You use the RealtimeConversationRequestItem object to create a new item in the conversation via the conversation.item.create event.

This is a union type that can be one of the following:

RealtimeSystemMessageItem

A system message item.

Field	Type	Description
type	string	The type of the item. Allowed values: `message`
role	string	The role of the message. Allowed values: `system`
content	array of RealtimeInputTextContentPart	The content of the message.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeUserMessageItem

A user message item.

Field	Type	Description
type	string	The type of the item. Allowed values: `message`
role	string	The role of the message. Allowed values: `user`
content	array of RealtimeInputTextContentPart or RealtimeInputAudioContentPart	The content of the message.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeAssistantMessageItem

An assistant message item.

Field	Type	Description
type	string	The type of the item. Allowed values: `message`
role	string	The role of the message. Allowed values: `assistant`
content	array of RealtimeOutputTextContentPart	The content of the message.

RealtimeFunctionCallItem

A function call item.

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call`
name	string	The name of the function to call.
arguments	string	The arguments of the function call as a JSON string.
call_id	string	The ID of the function call item.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeFunctionCallOutputItem

A function call output item.

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call_output`
call_id	string	The ID of the function call item.
output	string	The output of the function call, this is a free-form string with the function result, also could be empty.
id	string	The unique ID of the item. If the client doesn't provide an ID, the server generates one.

RealtimeMCPApprovalResponseItem

An MCP approval response item.

Field	Type	Description
type	string	The type of the item. Allowed values: `mcp_approval_response`
approve	boolean	Whether the MCP request is approved.
approval_request_id	string	The ID of the MCP approval request.
id	string	The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one.

RealtimeFunctionTool

The definition of a function tool as used by the realtime endpoint.

Field	Type	Description
type	string	The type of the tool. Allowed values: `function`
name	string	The name of the function.
description	string	The description of the function, including usage guidelines. For example, "Use this function to get the current time."
parameters	object	The parameters of the function in the form of a JSON object.

RealtimeItemStatus

Allowed Values:

in_progress
completed
incomplete

RealtimeResponseAudioContentPart

Field	Type	Description
type	string	The type of the content part. Allowed values: `audio`
transcript	string	The transcript of the audio. This property is nullable.

RealtimeResponseFunctionCallItem

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call`
name	string	The name of the function call item.
call_id	string	The ID of the function call item.
arguments	string	The arguments of the function call item.
status	RealtimeItemStatus	The status of the item.

RealtimeResponseFunctionCallOutputItem

Field	Type	Description
type	string	The type of the item. Allowed values: `function_call_output`
call_id	string	The ID of the function call item.
output	string	The output of the function call item.

RealtimeResponseOptions

Field	Type	Description
modalities	array	The output modalities for the response. Allowed values: `text`, `audio` For example, `"modalities": ["text", "audio"]` is the default setting that enables both text and audio output modalities. To enable only text output, set `"modalities": ["text"]`. You can't enable only audio.
instructions	string	The instructions (the system message) to guide the model's responses.
voice	RealtimeVoice	The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed.
tools	array of RealtimeTool	The tools available to the model for the session.
tool_choice	RealtimeToolChoice	The tool choice for the session.
temperature	number	The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8.
max_response_output_tokens	integer or "inf"	The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`. Defaults to `"inf"`.
interim-response	InterimResponseConfig	Optional. Configuration for interim response generation during latency or tool calls.
reasoning_effort	ReasoningEffort	Optional. Constrains effort on reasoning for reasoning models. Check model documentation for supported values for each model. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.
conversation	string	Controls which conversation the response is added to. The supported values are `auto` and `none`. The `auto` value (or not setting this property) ensures that the contents of the response are added to the session's default conversation. Set this property to `none` to create an out-of-band response where items won't be added to the default conversation. Defaults to `"auto"`
metadata	map	Set of up to 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long. For example: `metadata: { topic: "classification" }`
interim_response	InterimResponseConfig	Optional. Configuration for interim response generation during latency or tool calls. Overrides the session-level setting for this response.
pre_generated_assistant_message	RealtimeAssistantMessageItem	Optional. A pre-generated assistant message to use for generating the audio response instead of having the model generate the text. When provided, the server generates an audio response for the predefined text, bypassing model inference for text generation. The message is added to the conversation context history. The message must have the `role` set to `"assistant"` and include `content` with a single text content part.

RealtimeResponseSession

The RealtimeResponseSession object represents a session in the Realtime API. It's used in some of the server events, such as:

session.created
session.updated

Field	Type	Description
object	string	The session object. Allowed values: `realtime.session`
id	string	The unique ID of the session.
model	string	The model used for the session.
modalities	array	The output modalities for the session. Allowed values: `text`, `audio` For example, `"modalities": ["text", "audio"]` is the default setting that enables both text and audio output modalities. To enable only text output, set `"modalities": ["text"]`. You can't enable only audio.
instructions	string	The instructions (the system message) to guide the model's text and audio responses. Here are some example instructions to help guide content and format of text and audio responses: `"instructions": "be succinct"` `"instructions": "act friendly"` `"instructions": "here are examples of good responses"` Here are some example instructions to help guide audio behavior: `"instructions": "talk quickly"` `"instructions": "inject emotion into your voice"` `"instructions": "laugh frequently"` While the model might not always follow these instructions, they provide guidance on the desired behavior.
voice	RealtimeVoice	The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed.
input_audio_sampling_rate	integer	The sampling rate for the input audio.
input_audio_format	RealtimeAudioFormat	The format for the input audio.
output_audio_format	RealtimeAudioFormat	The format for the output audio.
input_audio_transcription	RealtimeAudioInputTranscriptionSettings	The settings for audio input transcription. This property is nullable.
turn_detection	RealtimeTurnDetection	The turn detection settings for the session. This property is nullable.
tools	array of RealtimeTool	The tools available to the model for the session.
tool_choice	RealtimeToolChoice	The tool choice for the session.
temperature	number	The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8.
max_response_output_tokens	integer or "inf"	The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set `"max_response_output_tokens": 1000`. To allow the maximum number of tokens, set `"max_response_output_tokens": "inf"`.
interim-response	InterimResponseConfig	Configuration for interim response generation during latency or tool calls.

RealtimeResponseStatusDetails

Field	Type	Description
type	RealtimeResponseStatus	The status of the response.

RealtimeRateLimitsItem

Field	Type	Description
name	string	The rate limit property name that this item includes information about.
limit	integer	The maximum configured limit for this rate limit property.
remaining	integer	The remaining quota available against the configured limit for this rate limit property.
reset_seconds	number	The remaining time, in seconds, until this rate limit property is reset.

Try the Voice Live quickstart
Try the Voice Live agents quickstart
Learn more about How to use the Voice Live API

Feedback

Was this page helpful?

Last updated on 2026-06-02

Voice Live 2026-06-01-preview API Reference

What's new in 2026-06-01-preview

Endpoint and authentication

WebSocket endpoint

Authentication

Client Events

session.update

Event Structure

Properties

Example with Azure Custom Voice

session.avatar.connect

Event Structure

Properties

input_audio_buffer.append

Event Structure

Properties

input_audio_buffer.commit

Event Structure

Properties

input_audio_buffer.clear

Event Structure

Properties

input_text.delta

Event Structure

Properties

input_text.done

Event Structure

Properties

conversation.item.create

Event Structure

Properties

Example with Audio Content

Example with Function Call output

Example with MCP approval response

conversation.item.retrieve

Event Structure

Properties

conversation.item.truncate

Event Structure

Properties

conversation.item.delete

Event Structure

Properties

response.create

Event Structure

Properties

Example with Tool Choice

Example with Animation

Example with pre-generated assistant message

response.cancel

Event Structure

Properties

output_audio_buffer.clear

Event Structure

Properties

input_audio_buffer.append

Event structure

Properties

input_audio_buffer.clear

Event structure

Properties

input_audio_buffer.commit

Event structure

Properties

Server Events

session.created

Event Structure

Properties

session.updated

Event Structure

Properties

session.avatar.connecting

Event Structure

Properties

conversation.item.created

Event Structure

Properties

Example with Audio Item

conversation.item.retrieved

Event Structure

Voice Live `2026-06-01-preview` API Reference