กิจกรรม
17 มี.ค. 21 - 21 มี.ค. 10
แอปอัจฉริยะ เข้าร่วมชุด meetup เพื่อสร้างโซลูชัน AI ที่ปรับขนาดได้ตามกรณีการใช้งานจริงกับนักพัฒนาและผู้เชี่ยวชาญร่วมกัน
ลงทะเบียนตอนนี้เบราว์เซอร์นี้ไม่ได้รับการสนับสนุนอีกต่อไป
อัปเกรดเป็น Microsoft Edge เพื่อใช้ประโยชน์จากคุณลักษณะล่าสุด เช่น การอัปเดตความปลอดภัยและการสนับสนุนด้านเทคนิค
หมายเหตุ
This feature is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
The Realtime API is a WebSocket-based API that allows you to interact with the Azure OpenAI service in real-time.
The Realtime API (via /realtime
) is built on the WebSockets API to facilitate fully asynchronous streaming communication between the end user and model. Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
เคล็ดลับ
To get started with the Realtime API, see the quickstart and how-to guide.
The Realtime API requires an existing Azure OpenAI resource endpoint in a supported region. The API is accessed via a secure WebSocket connection to the /realtime
endpoint of your Azure OpenAI resource.
You can construct a full request URI by concatenating:
wss://
) protocol.my-aoai-resource.openai.azure.com
openai/realtime
API path.api-version
query string parameter for a supported API version such as 2024-12-17
deployment
query string parameter with the name of your gpt-4o-realtime-preview
or gpt-4o-mini-realtime-preview
model deployment.The following example is a well-constructed /realtime
request URI:
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-12-17&deployment=gpt-4o-realtime-preview
To authenticate:
/realtime
API for an Azure OpenAI Service resource with managed identity enabled. Apply a retrieved authentication token using a Bearer
token with the Authorization
header.api-key
can be provided in one of two ways:
api-key
connection header on the prehandshake connection. This option isn't available in a browser environment.api-key
query string parameter on the request URI. Query string parameters are encrypted when using https/wss.There are nine client events that can be sent from the client to the server:
Event | Description |
---|---|
RealtimeClientEventConversationItemCreate | The client conversation.item.create event is used to add a new item to the conversation's context, including messages, function calls, and function call responses. |
RealtimeClientEventConversationItemDelete | The client conversation.item.delete event is used to remove an item from the conversation history. |
RealtimeClientEventConversationItemTruncate | The client conversation.item.truncate event is used to truncate a previous assistant message's audio. |
RealtimeClientEventInputAudioBufferAppend | The client input_audio_buffer.append event is used to append audio bytes to the input audio buffer. |
RealtimeClientEventInputAudioBufferClear | The client input_audio_buffer.clear event is used to clear the audio bytes in the buffer. |
RealtimeClientEventInputAudioBufferCommit | The client input_audio_buffer.commit event is used to commit the user input audio buffer. |
RealtimeClientEventResponseCancel | The client response.cancel event is used to cancel an in-progress response. |
RealtimeClientEventResponseCreate | The client response.create event is used to instruct the server to create a response via model inferencing. |
RealtimeClientEventSessionUpdate | The client session.update event is used to update the session's default configuration. |
The client conversation.item.create
event is used to add a new item to the conversation's context, including messages, function calls, and function call responses. This event can be used to populate a history of the conversation and to add new items mid-stream. Currently this event can't populate assistant audio messages.
If successful, the server responds with a conversation.item.created
event, otherwise an error
event is sent.
{
"type": "conversation.item.create",
"previous_item_id": "<previous_item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.create . |
previous_item_id | string | The ID of the preceding item after which the new item is inserted. If not set, the new item is appended to the end of the conversation. If set, it allows an item to be inserted mid-conversation. If the ID can't be found, then an error is returned and the item isn't added. |
item | RealtimeConversationRequestItem | The item to add to the conversation. |
The client conversation.item.delete
event is used to remove an item from the conversation history.
The server responds with a conversation.item.deleted
event, unless the item doesn't exist in the conversation history, in which case the server responds with an error.
{
"type": "conversation.item.delete",
"item_id": "<item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.delete . |
item_id | string | The ID of the item to delete. |
The client conversation.item.truncate
event is used to truncate a previous assistant message's audio. The server produces audio faster than realtime, so this event is useful when the user interrupts to truncate audio that was sent to the client but not yet played. The server's understanding of the audio with the client's playback is synchronized.
Truncating audio deletes the server-side text transcript to ensure there isn't text in the context that the user doesn't know about.
If the client event is successful, the server responds with a conversation.item.truncated
event.
{
"type": "conversation.item.truncate",
"item_id": "<item_id>",
"content_index": 0,
"audio_end_ms": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.truncate . |
item_id | string | The ID of the assistant message item to truncate. Only assistant message items can be truncated. |
content_index | integer | The index of the content part to truncate. Set this property to "0". |
audio_end_ms | integer | Inclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server responds with an error. |
The client input_audio_buffer.append
event is used to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit.
In Server VAD (Voice Activity Detection) mode, the audio buffer is used to detect speech and the server decides when to commit. When server VAD is disabled, the client can choose how much audio to place in each event up to a maximum of 15 MiB. For example, streaming smaller chunks from the client can allow the VAD to be more responsive.
Unlike most other client events, the server doesn't send a confirmation response to client input_audio_buffer.append
event.
{
"type": "input_audio_buffer.append",
"audio": "<audio>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.append . |
audio | string | Base64-encoded audio bytes. This value must be in the format specified by the input_audio_format field in the session configuration. |
The client input_audio_buffer.clear
event is used to clear the audio bytes in the buffer.
The server responds with an input_audio_buffer.cleared
event.
{
"type": "input_audio_buffer.clear"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.clear . |
The client input_audio_buffer.commit
event is used to commit the user input audio buffer, which creates a new user message item in the conversation. Audio is transcribed if input_audio_transcription
is configured for the session.
When in server VAD mode, the client doesn't need to send this event, the server commits the audio buffer automatically. Without server VAD, the client must commit the audio buffer to create a user message item. This client event produces an error if the input audio buffer is empty.
Committing the input audio buffer doesn't create a response from the model.
The server responds with an input_audio_buffer.committed
event.
{
"type": "input_audio_buffer.commit"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.commit . |
The client response.cancel
event is used to cancel an in-progress response.
The server responds with a response.cancelled
event or an error if there's no response to cancel.
{
"type": "response.cancel"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.cancel . |
The client response.create
event is used to instruct the server to create a response via model inferencing. When the session is configured in server VAD mode, the server creates responses automatically.
A response includes at least one item
, and can have two, in which case the second is a function call. These items are appended to the conversation history.
The server responds with a response.created
event, one or more item and content events (such as conversation.item.created
and response.content_part.added
), and finally a response.done
event to indicate the response is complete.
หมายเหตุ
The client response.create
event includes inference configuration like
instructions
, and temperature
. These fields can override the session's configuration for this response only.
{
"type": "response.create"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.create . |
response | RealtimeResponseOptions | The response options. |
The client session.update
event is used to update the session's default configuration. The client can send this event at any time to update the session configuration, and any field can be updated at any time, except for voice.
Only fields that are present are updated. To clear a field (such as instructions
), pass an empty string.
The server responds with a session.updated
event that contains the full effective configuration.
{
"type": "session.update"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be session.update . |
session | RealtimeRequestSession | The session configuration. |
There are 28 server events that can be received from the server:
Event | Description |
---|---|
RealtimeServerEventConversationCreated | The server conversation.created event is returned right after session creation. One conversation is created per session. |
RealtimeServerEventConversationItemCreated | The server conversation.item.created event is returned when a conversation item is created. |
RealtimeServerEventConversationItemDeleted | The server conversation.item.deleted event is returned when the client deleted an item in the conversation with a conversation.item.delete event. |
RealtimeServerEventConversationItemInputAudioTranscriptionCompleted | The server conversation.item.input_audio_transcription.completed event is the result of audio transcription for speech written to the audio buffer. |
RealtimeServerEventConversationItemInputAudioTranscriptionFailed | The server conversation.item.input_audio_transcription.failed event is returned when input audio transcription is configured, and a transcription request for a user message failed. |
RealtimeServerEventConversationItemTruncated | The server conversation.item.truncated event is returned when the client truncates an earlier assistant audio message item with a conversation.item.truncate event. |
RealtimeServerEventError | The server error event is returned when an error occurs, which could be a client problem or a server problem. |
RealtimeServerEventInputAudioBufferCleared | The server input_audio_buffer.cleared event is returned when the client clears the input audio buffer with a input_audio_buffer.clear event. |
RealtimeServerEventInputAudioBufferCommitted | The server input_audio_buffer.committed event is returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. |
RealtimeServerEventInputAudioBufferSpeechStarted | The server input_audio_buffer.speech_started event is returned in server_vad mode when speech is detected in the audio buffer. |
RealtimeServerEventInputAudioBufferSpeechStopped | The server input_audio_buffer.speech_stopped event is returned in server_vad mode when the server detects the end of speech in the audio buffer. |
RealtimeServerEventRateLimitsUpdated | The server rate_limits.updated event is emitted at the beginning of a response to indicate the updated rate limits. |
RealtimeServerEventResponseAudioDelta | The server response.audio.delta event is returned when the model-generated audio is updated. |
RealtimeServerEventResponseAudioDone | The server response.audio.done event is returned when the model-generated audio is done. |
RealtimeServerEventResponseAudioTranscriptDelta | The server response.audio_transcript.delta event is returned when the model-generated transcription of audio output is updated. |
RealtimeServerEventResponseAudioTranscriptDone | The server response.audio_transcript.done event is returned when the model-generated transcription of audio output is done streaming. |
RealtimeServerEventResponseContentPartAdded | The server response.content_part.added event is returned when a new content part is added to an assistant message item. |
RealtimeServerEventResponseContentPartDone | The server response.content_part.done event is returned when a content part is done streaming. |
RealtimeServerEventResponseCreated | The server response.created event is returned when a new response is created. This is the first event of response creation, where the response is in an initial state of in_progress . |
RealtimeServerEventResponseDone | The server response.done event is returned when a response is done streaming. |
RealtimeServerEventResponseFunctionCallArgumentsDelta | The server response.function_call_arguments.delta event is returned when the model-generated function call arguments are updated. |
RealtimeServerEventResponseFunctionCallArgumentsDone | The server response.function_call_arguments.done event is returned when the model-generated function call arguments are done streaming. |
RealtimeServerEventResponseOutputItemAdded | The server response.output_item.added event is returned when a new item is created during response generation. |
RealtimeServerEventResponseOutputItemDone | The server response.output_item.done event is returned when an item is done streaming. |
RealtimeServerEventResponseTextDelta | The server response.text.delta event is returned when the model-generated text is updated. |
RealtimeServerEventResponseTextDone | The server response.text.done event is returned when the model-generated text is done streaming. |
RealtimeServerEventSessionCreated | The server session.created event is the first server event when you establish a new connection to the Realtime API. This event creates and returns a new session with the default session configuration. |
RealtimeServerEventSessionUpdated | The server session.updated event is returned when a session is updated by the client. If there's an error, the server sends an error event instead. |
The server conversation.created
event is returned right after session creation. One conversation is created per session.
{
"type": "conversation.created",
"conversation": {
"id": "<id>",
"object": "<object>"
}
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.created . |
conversation | object | The conversation resource. |
Field | Type | Description |
---|---|---|
id | string | The unique ID of the conversation. |
object | string | The object type must be realtime.conversation . |
The server conversation.item.created
event is returned when a conversation item is created. There are several scenarios that produce this event:
message
(role assistant
) or type function_call
.server_vad
mode). The server takes the content of the input audio buffer and adds it to a new user message item.conversation.item.create
event to add a new item to the conversation.{
"type": "conversation.item.created",
"previous_item_id": "<previous_item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.created . |
previous_item_id | string | The ID of the preceding item in the conversation context, allows the client to understand the order of the conversation. |
item | RealtimeConversationResponseItem | The item that was created. |
The server conversation.item.deleted
event is returned when the client deleted an item in the conversation with a conversation.item.delete
event. This event is used to synchronize the server's understanding of the conversation history with the client's view.
{
"type": "conversation.item.deleted",
"item_id": "<item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.deleted . |
item_id | string | The ID of the item that was deleted. |
The server conversation.item.input_audio_transcription.completed
event is the result of audio transcription for speech written to the audio buffer.
Transcription begins when the input audio buffer is committed by the client or server (in server_vad
mode). Transcription runs asynchronously with response creation, so this event can come before or after the response events.
Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate speech recognition model, currently always whisper-1
. Thus the transcript can diverge somewhat from the model's interpretation, and should be treated as a rough guide.
{
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "<item_id>",
"content_index": 0,
"transcript": "<transcript>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.input_audio_transcription.completed . |
item_id | string | The ID of the user message item containing the audio. |
content_index | integer | The index of the content part containing the audio. |
transcript | string | The transcribed text. |
The server conversation.item.input_audio_transcription.failed
event is returned when input audio transcription is configured, and a transcription request for a user message failed. This event is separate from other error
events so that the client can identify the related item.
{
"type": "conversation.item.input_audio_transcription.failed",
"item_id": "<item_id>",
"content_index": 0,
"error": {
"code": "<code>",
"message": "<message>",
"param": "<param>"
}
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.input_audio_transcription.failed . |
item_id | string | The ID of the user message item. |
content_index | integer | The index of the content part containing the audio. |
error | object | Details of the transcription error. See nested properties in the next table. |
Field | Type | Description |
---|---|---|
type | string | The type of error. |
code | string | Error code, if any. |
message | string | A human-readable error message. |
param | string | Parameter related to the error, if any. |
The server conversation.item.truncated
event is returned when the client truncates an earlier assistant audio message item with a conversation.item.truncate
event. This event is used to synchronize the server's understanding of the audio with the client's playback.
This event truncates the audio and removes the server-side text transcript to ensure there's no text in the context that the user doesn't know about.
{
"type": "conversation.item.truncated",
"item_id": "<item_id>",
"content_index": 0,
"audio_end_ms": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be conversation.item.truncated . |
item_id | string | The ID of the assistant message item that was truncated. |
content_index | integer | The index of the content part that was truncated. |
audio_end_ms | integer | The duration up to which the audio was truncated, in milliseconds. |
The server error
event is returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session stays open.
{
"type": "error",
"error": {
"code": "<code>",
"message": "<message>",
"param": "<param>",
"event_id": "<event_id>"
}
}
Field | Type | Description |
---|---|---|
type | string | The event type must be error . |
error | object | Details of the error. See nested properties in the next table. |
Field | Type | Description |
---|---|---|
type | string | The type of error. For example, "invalid_request_error" and "server_error" are error types. |
code | string | Error code, if any. |
message | string | A human-readable error message. |
param | string | Parameter related to the error, if any. |
event_id | string | The ID of the client event that caused the error, if applicable. |
The server input_audio_buffer.cleared
event is returned when the client clears the input audio buffer with a input_audio_buffer.clear
event.
{
"type": "input_audio_buffer.cleared"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.cleared . |
The server input_audio_buffer.committed
event is returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id
property is the ID of the user message item created. Thus a conversation.item.created
event is also sent to the client.
{
"type": "input_audio_buffer.committed",
"previous_item_id": "<previous_item_id>",
"item_id": "<item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.committed . |
previous_item_id | string | The ID of the preceding item after which the new item is inserted. |
item_id | string | The ID of the user message item created. |
The server input_audio_buffer.speech_started
event is returned in server_vad
mode when speech is detected in the audio buffer. This event can happen any time audio is added to the buffer (unless speech is already detected).
หมายเหตุ
The client might want to use this event to interrupt audio playback or provide visual feedback to the user.
The client should expect to receive a input_audio_buffer.speech_stopped
event when speech stops. The item_id
property is the ID of the user message item created when speech stops. The item_id
is also included in the input_audio_buffer.speech_stopped
event unless the client manually commits the audio buffer during VAD activation.
{
"type": "input_audio_buffer.speech_started",
"audio_start_ms": 0,
"item_id": "<item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.speech_started . |
audio_start_ms | integer | Milliseconds from the start of all audio written to the buffer during the session when speech was first detected. This property corresponds to the beginning of audio sent to the model, and thus includes the prefix_padding_ms configured in the session. |
item_id | string | The ID of the user message item created when speech stops. |
The server input_audio_buffer.speech_stopped
event is returned in server_vad
mode when the server detects the end of speech in the audio buffer.
The server also sends a conversation.item.created
event with the user message item created from the audio buffer.
{
"type": "input_audio_buffer.speech_stopped",
"audio_end_ms": 0,
"item_id": "<item_id>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be input_audio_buffer.speech_stopped . |
audio_end_ms | integer | Milliseconds since the session started when speech stopped. This property corresponds to the end of audio sent to the model, and thus includes the min_silence_duration_ms configured in the session. |
item_id | string | The ID of the user message item created. |
The server rate_limits.updated
event is emitted at the beginning of a response to indicate the updated rate limits.
When a response is created, some tokens are reserved for the output tokens. The rate limits shown here reflect that reservation, which is then adjusted accordingly once the response is completed.
{
"type": "rate_limits.updated",
"rate_limits": [
{
"name": "<name>",
"limit": 0,
"remaining": 0,
"reset_seconds": 0
}
]
}
Field | Type | Description |
---|---|---|
type | string | The event type must be rate_limits.updated . |
rate_limits | array of RealtimeServerEventRateLimitsUpdatedRateLimitsItem | The list of rate limit information. |
The server response.audio.delta
event is returned when the model-generated audio is updated.
{
"type": "response.audio.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"delta": "<delta>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.audio.delta . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
delta | string | Base64-encoded audio data delta. |
The server response.audio.done
event is returned when the model-generated audio is done.
This event is also returned when a response is interrupted, incomplete, or canceled.
{
"type": "response.audio.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.audio.done . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
The server response.audio_transcript.delta
event is returned when the model-generated transcription of audio output is updated.
{
"type": "response.audio_transcript.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"delta": "<delta>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.audio_transcript.delta . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
delta | string | The transcript delta. |
The server response.audio_transcript.done
event is returned when the model-generated transcription of audio output is done streaming.
This event is also returned when a response is interrupted, incomplete, or canceled.
{
"type": "response.audio_transcript.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"transcript": "<transcript>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.audio_transcript.done . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
transcript | string | The final transcript of the audio. |
The server response.content_part.added
event is returned when a new content part is added to an assistant message item during response generation.
{
"type": "response.content_part.added",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.content_part.added . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item to which the content part was added. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
part | RealtimeContentPart | The content part that was added. |
Field | Type | Description |
---|---|---|
type | RealtimeContentPartType |
The server response.content_part.done
event is returned when a content part is done streaming in an assistant message item.
This event is also returned when a response is interrupted, incomplete, or canceled.
{
"type": "response.content_part.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.content_part.done . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
part | RealtimeContentPart | The content part that is done. |
Field | Type | Description |
---|---|---|
type | RealtimeContentPartType |
The server response.created
event is returned when a new response is created. This is the first event of response creation, where the response is in an initial state of in_progress
.
{
"type": "response.created"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.created . |
response | RealtimeResponse | The response object. |
The server response.done
event is returned when a response is done streaming. This event is always emitted, no matter the final state. The response object included in the response.done
event includes all output items in the response, but omits the raw audio data.
{
"type": "response.done"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.done . |
response | RealtimeResponse | The response object. |
The server response.function_call_arguments.delta
event is returned when the model-generated function call arguments are updated.
{
"type": "response.function_call_arguments.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"call_id": "<call_id>",
"delta": "<delta>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.function_call_arguments.delta . |
response_id | string | The ID of the response. |
item_id | string | The ID of the function call item. |
output_index | integer | The index of the output item in the response. |
call_id | string | The ID of the function call. |
delta | string | The arguments delta as a JSON string. |
The server response.function_call_arguments.done
event is returned when the model-generated function call arguments are done streaming.
This event is also returned when a response is interrupted, incomplete, or canceled.
{
"type": "response.function_call_arguments.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"call_id": "<call_id>",
"arguments": "<arguments>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.function_call_arguments.done . |
response_id | string | The ID of the response. |
item_id | string | The ID of the function call item. |
output_index | integer | The index of the output item in the response. |
call_id | string | The ID of the function call. |
arguments | string | The final arguments as a JSON string. |
The server response.output_item.added
event is returned when a new item is created during response generation.
{
"type": "response.output_item.added",
"response_id": "<response_id>",
"output_index": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.output_item.added . |
response_id | string | The ID of the response to which the item belongs. |
output_index | integer | The index of the output item in the response. |
item | RealtimeConversationResponseItem | The item that was added. |
The server response.output_item.done
event is returned when an item is done streaming.
This event is also returned when a response is interrupted, incomplete, or canceled.
{
"type": "response.output_item.done",
"response_id": "<response_id>",
"output_index": 0
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.output_item.done . |
response_id | string | The ID of the response to which the item belongs. |
output_index | integer | The index of the output item in the response. |
item | RealtimeConversationResponseItem | The item that is done streaming. |
The server response.text.delta
event is returned when the model-generated text is updated. The text corresponds to the text
content part of an assistant message item.
{
"type": "response.text.delta",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"delta": "<delta>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.text.delta . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
delta | string | The text delta. |
The server response.text.done
event is returned when the model-generated text is done streaming. The text corresponds to the text
content part of an assistant message item.
This event is also returned when a response is interrupted, incomplete, or canceled.
{
"type": "response.text.done",
"response_id": "<response_id>",
"item_id": "<item_id>",
"output_index": 0,
"content_index": 0,
"text": "<text>"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be response.text.done . |
response_id | string | The ID of the response. |
item_id | string | The ID of the item. |
output_index | integer | The index of the output item in the response. |
content_index | integer | The index of the content part in the item's content array. |
text | string | The final text content. |
The server session.created
event is the first server event when you establish a new connection to the Realtime API. This event creates and returns a new session with the default session configuration.
{
"type": "session.created"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be session.created . |
session | RealtimeResponseSession | The session object. |
The server session.updated
event is returned when a session is updated by the client. If there's an error, the server sends an error
event instead.
{
"type": "session.updated"
}
Field | Type | Description |
---|---|---|
type | string | The event type must be session.updated . |
session | RealtimeResponseSession | The session object. |
Allowed Values:
pcm16
g711_ulaw
g711_alaw
Allowed Values:
whisper-1
Field | Type | Description |
---|---|---|
model | RealtimeAudioInputTranscriptionModel | The default whisper-1 model is currently the only model supported for audio input transcription. |
Field | Type | Description |
---|---|---|
type | RealtimeClientEventType | The type of the client event. |
event_id | string | The unique ID of the event. The client can specify the ID to help identify the event. |
Allowed Values:
session.update
input_audio_buffer.append
input_audio_buffer.commit
input_audio_buffer.clear
conversation.item.create
conversation.item.delete
conversation.item.truncate
response.create
response.cancel
Field | Type | Description |
---|---|---|
type | RealtimeContentPartType | The content type. A property of the function object.Allowed values: input_text , input_audio , item_reference , text . |
text | string | The text content. This property is applicable for the input_text and text content types. |
id | string | ID of a previous conversation item to reference in both client and server created items. This property is applicable for the item_reference content type in response.create events. |
audio | string | The base64-encoded audio bytes. This property is applicable for the input_audio content type. |
transcript | string | The transcript of the audio. This property is applicable for the input_audio content type. |
Allowed Values:
input_text
input_audio
text
audio
The item to add to the conversation.
This table describes all RealtimeConversationItem
properties. The properties that are applicable per event depend on the RealtimeItemType.
Field | Type | Description |
---|---|---|
id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
type | RealtimeItemType | The type of the item. Allowed values: message , function_call , function_call_output |
object | string | The identifier for the API object being returned. The value will always be realtime.item . |
status | RealtimeItemStatus | The status of the item. This field doesn't affect the conversation, but it's accepted for consistency with the conversation.item.created event.Allowed values: completed , incomplete |
role | RealtimeMessageRole | The role of the message sender. This property is only applicable for message items. Allowed values: system , user , assistant |
content | array of RealtimeContentPart | The content of the message. This property is only applicable for message items.- Message items of role system support only input_text content.- Message items of role user support input_text and input_audio content.- Message items of role assistant support text content. |
call_id | string | The ID of the function call (for function_call and function_call_output items). If passed on a function_call_output item, the server will check that a function_call item with the same ID exists in the conversation history. |
name | string | The name of the function being called (for function_call items). |
arguments | string | The arguments of the function call (for function_call items). |
output | string | The output of the function call (for function_call_output items). |
You use the RealtimeConversationRequestItem
object to create a new item in the conversation via the conversation.item.create event.
Field | Type | Description |
---|---|---|
type | RealtimeItemType | The type of the item. |
id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. |
The RealtimeConversationResponseItem
object represents an item in the conversation. It's used in some of the server events, such as:
response.created
(via the response
property type RealtimeResponse
)response.done
(via the response
property type RealtimeResponse
)Field | Type | Description |
---|---|---|
object | string | The identifier for the returned API object. Allowed values: realtime.item |
type | RealtimeItemType | The type of the item. Allowed values: message , function_call , function_call_output |
id | string | The unique ID of the item. The client can specify the ID to help manage server-side context. If the client doesn't provide an ID, the server generates one. This property is nullable. |
The definition of a function tool as used by the realtime endpoint.
Field | Type | Description |
---|---|---|
type | string | The type of the tool. Allowed values: function |
name | string | The name of the function. |
description | string | The description of the function, including usage guidelines. For example, "Use this function to get the current time." |
parameters | object | The parameters of the function in the form of a JSON object. |
Allowed Values:
in_progress
completed
incomplete
Allowed Values:
message
function_call
function_call_output
Allowed Values:
system
user
assistant
Field | Type | Description |
---|---|---|
role | string | The role of the message. Allowed values: assistant |
content | array of RealtimeRequestTextContentPart | The content of the message. |
Field | Type | Description |
---|---|---|
type | string | The type of the content part. Allowed values: input_audio |
transcript | string | The transcript of the audio. |
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: function_call |
name | string | The name of the function call item. |
call_id | string | The ID of the function call item. |
arguments | string | The arguments of the function call item. |
status | RealtimeItemStatus | The status of the item. |
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: function_call_output |
call_id | string | The ID of the function call item. |
output | string | The output of the function call item. |
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: message |
role | RealtimeMessageRole | The role of the message. |
status | RealtimeItemStatus | The status of the item. |
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: message |
id | string | The ID of the message item. |
You use the RealtimeRequestSession
object when you want to update the session configuration via the session.update event.
Field | Type | Description |
---|---|---|
modalities | array | The modalities that the session supports. Allowed values: text , audio For example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"] . You can't enable only audio. |
instructions | string | The instructions (the system message) to guide the model's text and audio responses. Here are some example instructions to help guide content and format of text and audio responses: "instructions": "be succinct" "instructions": "act friendly" "instructions": "here are examples of good responses" Here are some example instructions to help guide audio behavior: "instructions": "talk quickly" "instructions": "inject emotion into your voice" "instructions": "laugh frequently" While the model might not always follow these instructions, they provide guidance on the desired behavior. |
voice | RealtimeVoice | The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed. |
input_audio_format | RealtimeAudioFormat | The format for the input audio. |
output_audio_format | RealtimeAudioFormat | The format for the output audio. |
input_audio_transcription | RealtimeAudioInputTranscriptionSettings | The settings for audio input transcription. This property is nullable. |
turn_detection | RealtimeTurnDetection | The turn detection settings for the session. This property is nullable. |
tools | array of RealtimeTool | The tools available to the model for the session. |
tool_choice | RealtimeToolChoice | The tool choice for the session. Allowed values: auto , none , and required . Otherwise, you can specify the name of the function to use. |
temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
max_response_output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set "max_response_output_tokens": 1000 . To allow the maximum number of tokens, set "max_response_output_tokens": "inf" .Defaults to "inf" . |
Field | Type | Description |
---|---|---|
role | string | The role of the message. Allowed values: system |
content | array of RealtimeRequestTextContentPart | The content of the message. |
Field | Type | Description |
---|---|---|
type | string | The type of the content part. Allowed values: input_text |
text | string | The text content. |
Field | Type | Description |
---|---|---|
role | string | The role of the message. Allowed values: user |
content | array of RealtimeRequestTextContentPart or RealtimeRequestAudioContentPart | The content of the message. |
Field | Type | Description |
---|---|---|
object | string | The response object. Allowed values: realtime.response |
id | string | The unique ID of the response. |
status | RealtimeResponseStatus | The status of the response. The default status value is in_progress . |
status_details | RealtimeResponseStatusDetails | The details of the response status. This property is nullable. |
output | array of RealtimeConversationResponseItem | The output items of the response. |
usage | object | Usage statistics for the response. Each Realtime API session maintains a conversation context and appends new items to the conversation. Output from previous turns (text and audio tokens) is input for later turns. See nested properties next. |
+ total_tokens | integer | The total number of tokens in the Response including input and output text and audio tokens. A property of the usage object. |
+ input_tokens | integer | The number of input tokens used in the response, including text and audio tokens. A property of the usage object. |
+ output_tokens | integer | The number of output tokens sent in the response, including text and audio tokens. A property of the usage object. |
+ input_token_details | object | Details about the input tokens used in the response. A property of the usage object.br> See nested properties next. |
+ cached_tokens | integer | The number of cached tokens used in the response. A property of the input_token_details object. |
+ text_tokens | integer | The number of text tokens used in the response. A property of the input_token_details object. |
+ audio_tokens | integer | The number of audio tokens used in the response. A property of the input_token_details object. |
+ output_token_details | object | Details about the output tokens used in the response. A property of the usage object.See nested properties next. |
+ text_tokens | integer | The number of text tokens used in the response. A property of the output_token_details object. |
+ audio_tokens | integer | The number of audio tokens used in the response. A property of the output_token_details object. |
Field | Type | Description |
---|---|---|
type | string | The type of the content part. Allowed values: audio |
transcript | string | The transcript of the audio. This property is nullable. |
The response resource.
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: function_call |
name | string | The name of the function call item. |
call_id | string | The ID of the function call item. |
arguments | string | The arguments of the function call item. |
status | RealtimeItemStatus | The status of the item. |
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: function_call_output |
call_id | string | The ID of the function call item. |
output | string | The output of the function call item. |
Field | Type | Description |
---|---|---|
type | string | The type of the item. Allowed values: message |
role | RealtimeMessageRole | The role of the message. |
content | array | The content of the message. Array items: RealtimeResponseTextContentPart |
status | RealtimeItemStatus | The status of the item. |
Field | Type | Description |
---|---|---|
modalities | array | The modalities that the session supports. Allowed values: text , audio For example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"] . You can't enable only audio. |
instructions | string | The instructions (the system message) to guide the model's text and audio responses. Here are some example instructions to help guide content and format of text and audio responses: "instructions": "be succinct" "instructions": "act friendly" "instructions": "here are examples of good responses" Here are some example instructions to help guide audio behavior: "instructions": "talk quickly" "instructions": "inject emotion into your voice" "instructions": "laugh frequently" While the model might not always follow these instructions, they provide guidance on the desired behavior. |
voice | RealtimeVoice | The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed. |
output_audio_format | RealtimeAudioFormat | The format for the output audio. |
tools | array of RealtimeTool | The tools available to the model for the session. |
tool_choice | RealtimeToolChoice | The tool choice for the session. |
temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
max__output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set "max_response_output_tokens": 1000 . To allow the maximum number of tokens, set "max_response_output_tokens": "inf" .Defaults to "inf" . |
conversation | string | Controls which conversation the response is added to. The supported values are auto and none .The auto value (or not setting this property) ensures that the contents of the response are added to the session's default conversation.Set this property to none to create an out-of-band response where items won't be added to the default conversation. For more information, see the how-to guide.Defaults to "auto" |
metadata | map | Set of up to 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can be a maximum of 64 characters long and values can be a maximum of 512 characters long. For example: metadata: { topic: "classification" } |
input | array | Input items to include in the prompt for the model. Creates a new context for this response, without including the default conversation. Can include references to items from the default conversation. Array items: RealtimeConversationItemBase |
The RealtimeResponseSession
object represents a session in the Realtime API. It's used in some of the server events, such as:
Field | Type | Description |
---|---|---|
object | string | The session object. Allowed values: realtime.session |
id | string | The unique ID of the session. |
model | string | The model used for the session. |
modalities | array | The modalities that the session supports. Allowed values: text , audio For example, "modalities": ["text", "audio"] is the default setting that enables both text and audio modalities. To enable only text, set "modalities": ["text"] . You can't enable only audio. |
instructions | string | The instructions (the system message) to guide the model's text and audio responses. Here are some example instructions to help guide content and format of text and audio responses: "instructions": "be succinct" "instructions": "act friendly" "instructions": "here are examples of good responses" Here are some example instructions to help guide audio behavior: "instructions": "talk quickly" "instructions": "inject emotion into your voice" "instructions": "laugh frequently" While the model might not always follow these instructions, they provide guidance on the desired behavior. |
voice | RealtimeVoice | The voice used for the model response for the session. Once the voice is used in the session for the model's audio response, it can't be changed. |
input_audio_format | RealtimeAudioFormat | The format for the input audio. |
output_audio_format | RealtimeAudioFormat | The format for the output audio. |
input_audio_transcription | RealtimeAudioInputTranscriptionSettings | The settings for audio input transcription. This property is nullable. |
turn_detection | RealtimeTurnDetection | The turn detection settings for the session. This property is nullable. |
tools | array of RealtimeTool | The tools available to the model for the session. |
tool_choice | RealtimeToolChoice | The tool choice for the session. |
temperature | number | The sampling temperature for the model. The allowed temperature values are limited to [0.6, 1.2]. Defaults to 0.8. |
max_response_output_tokens | integer or "inf" | The maximum number of output tokens per assistant response, inclusive of tool calls. Specify an integer between 1 and 4096 to limit the output tokens. Otherwise, set the value to "inf" to allow the maximum number of tokens. For example, to limit the output tokens to 1000, set "max_response_output_tokens": 1000 . To allow the maximum number of tokens, set "max_response_output_tokens": "inf" . |
Allowed Values:
in_progress
completed
cancelled
incomplete
failed
Field | Type | Description |
---|---|---|
type | RealtimeResponseStatus | The status of the response. |
Field | Type | Description |
---|---|---|
type | string | The type of the content part. Allowed values: text |
text | string | The text content. |
Field | Type | Description |
---|---|---|
type | RealtimeServerEventType | The type of the server event. |
event_id | string | The unique ID of the server event. |
Field | Type | Description |
---|---|---|
name | string | The rate limit property name that this item includes information about. |
limit | integer | The maximum configured limit for this rate limit property. |
remaining | integer | The remaining quota available against the configured limit for this rate limit property. |
reset_seconds | number | The remaining time, in seconds, until this rate limit property is reset. |
Allowed Values:
session.created
session.updated
conversation.created
conversation.item.created
conversation.item.deleted
conversation.item.truncated
response.created
response.done
rate_limits.updated
response.output_item.added
response.output_item.done
response.content_part.added
response.content_part.done
response.audio.delta
response.audio.done
response.audio_transcript.delta
response.audio_transcript.done
response.text.delta
response.text.done
response.function_call_arguments.delta
response.function_call_arguments.done
input_audio_buffer.speech_started
input_audio_buffer.speech_stopped
conversation.item.input_audio_transcription.completed
conversation.item.input_audio_transcription.failed
input_audio_buffer.committed
input_audio_buffer.cleared
error
Field | Type | Description |
---|---|---|
type | string | The type of turn detection. Allowed values: server_vad |
threshold | number | The activation threshold for the server VAD turn detection. In noisy environments, you might need to increase the threshold to avoid false positives. In quiet environments, you might need to decrease the threshold to avoid false negatives. Defaults to 0.5 . You can set the threshold to a value between 0.0 and 1.0 . |
prefix_padding_ms | string | The duration of speech audio (in milliseconds) to include before the start of detected speech. Defaults to 300 . |
silence_duration_ms | string | The duration of silence (in milliseconds) to detect the end of speech. You want to detect the end of speech as soon as possible, but not too soon to avoid cutting off the last part of the speech. The model will respond more quickly if you set this value to a lower number, but it might cut off the last part of the speech. If you set this value to a higher number, the model will wait longer to detect the end of speech, but it might take longer to respond. |
Realtime session object configuration.
The base representation of a realtime tool definition.
Field | Type | Description |
---|---|---|
type | RealtimeToolType | The type of the tool. |
The combined set of available representations for a realtime tool_choice
parameter, encompassing both string literal options like 'auto' and structured references to defined tools.
The representation of a realtime tool_choice
selecting a named function tool.
Field | Type | Description |
---|---|---|
type | string | The type of the tool_choice .Allowed values: function |
function | object | The function tool to select. See nested properties next. |
+ name | string | The name of the function tool. A property of the function object. |
The available set of mode-level, string literal tool_choice
options for the realtime endpoint.
Allowed Values:
auto
none
required
A base representation for a realtime tool_choice
selecting a named tool.
Field | Type | Description |
---|---|---|
type | RealtimeToolType | The type of the tool_choice . |
The supported tool type discriminators for realtime tools. Currently, only 'function' tools are supported.
Allowed Values:
function
Field | Type | Description |
---|---|---|
type | RealtimeTurnDetectionType | The type of turn detection. Allowed values: server_vad |
threshold | number | The activation threshold for the server VAD turn detection. In noisy environments, you might need to increase the threshold to avoid false positives. In quiet environments, you might need to decrease the threshold to avoid false negatives. Defaults to 0.5 . You can set the threshold to a value between 0.0 and 1.0 . |
prefix_padding_ms | string | The duration of speech audio (in milliseconds) to include before the start of detected speech. Defaults to 300 milliseconds. |
silence_duration_ms | string | The duration of silence (in milliseconds) to detect the end of speech. You want to detect the end of speech as soon as possible, but not too soon to avoid cutting off the last part of the speech. The model will respond more quickly if you set this value to a lower number, but it might cut off the last part of the speech. If you set this value to a higher number, the model will wait longer to detect the end of speech, but it might take longer to respond. Defaults to 500 milliseconds. |
create_response | boolean | Indicates whether the server will automatically create a response when VAD is enabled and speech stops. Defaults to true . |
Allowed Values:
server_vad
Allowed Values:
alloy
ash
ballad
coral
echo
sage
shimmer
verse
กิจกรรม
17 มี.ค. 21 - 21 มี.ค. 10
แอปอัจฉริยะ เข้าร่วมชุด meetup เพื่อสร้างโซลูชัน AI ที่ปรับขนาดได้ตามกรณีการใช้งานจริงกับนักพัฒนาและผู้เชี่ยวชาญร่วมกัน
ลงทะเบียนตอนนี้