Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The Responses API supports a WebSocket mode for long-running, tool-heavy workflows. In WebSocket mode, you keep a persistent connection to /v1/responses and continue each turn by sending only new input items together with a previous_response_id. This approach reduces per-turn overhead and improves end-to-end latency across long chains.
WebSocket mode works with store=false.
Prerequisites
- An Azure OpenAI model deployed.
- An authentication method:
- API key, or
- Microsoft Entra ID.
- For Python examples:
- Install the
websocket-clientpackage. - Install
azure-identityfor Microsoft Entra ID authentication.
- Install the
When to use WebSocket mode
Use WebSocket mode when a workflow involves many model-tool round trips, such as agentic coding or orchestration loops with repeated tool calls. Because the connection stays open and each turn sends only incremental input, continuation latency is lower than with repeated HTTP requests.
For single-shot requests or short conversations, keep using the standard HTTP Responses API.
How it works
You open one WebSocket connection to /v1/responses and drive it with response.create events:
- The first
response.createstarts a new turn. The payload mirrors the HTTP create body, except that transport-specific fields likestreamandbackgrounddon't apply. - Follow-up
response.createmessages chain from the prior response usingprevious_response_idand include only new input items.
Server events and ordering match the existing Responses streaming event model.
Start a turn
Send a response.create event on the open socket. The following examples connect to the WebSocket endpoint and ask the model a question. WebSocket mode supports both API key and Microsoft Entra ID authentication — choose the tab that matches your auth method.
from websocket import create_connection
import json
ws = create_connection(
f"wss://{YOUR_RESOURCE_NAME}.openai.azure.com/openai/v1/responses",
header=[f"Authorization: Bearer {YOUR_AOAI_API_KEY}"],
)
ws.send(json.dumps({
"type": "response.create",
"model": "gpt-4.1", # Replace with your model deployment name
"store": False,
"input": [
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Find fizz_buzz()"}],
}
],
"tools": [],
}))
Tip
You can optionally warm up request state by sending response.create with generate: false. Use this option when you already know the tools, instructions, or messages you plan to send with an upcoming turn. A warmup doesn't return model output but prepares request state so the next generated turn can start faster. The warmup request returns a response ID that you can chain from by using previous_response_id.
Stream the response
Read events from the WebSocket, print text as it streams in, and stop when the response is done.
while True:
event = json.loads(ws.recv())
if event["type"] == "response.output_text.delta":
print(event["delta"], end="", flush=True)
elif event["type"] == "response.completed":
response_id = event.get("response", {}).get("id")
print(f"\nResponse ID: {response_id}")
break
# Close the socket only when you are done with all turns.
# ws.close()
Continue with incremental inputs
To continue the same chain, send another response.create on the same socket with:
previous_response_idset to the prior response ID.inputcontaining only new items, such as tool outputs and the next user message.
ws.send(json.dumps({
"type": "response.create",
"model": "gpt-4.1",
"store": False,
"previous_response_id": f"{response_id}",
"input": [
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Now optimize it."}],
},
],
"tools": [],
}))
Continuation semantics
WebSocket mode uses the same previous_response_id chaining as HTTP mode, but adds a lower-latency continuation path on the active socket.
On an active WebSocket connection, the service keeps one previous-response state in a connection-local in-memory cache (the most recent response). Continuing from that response is fast because the service reuses connection-local state. Because this state is retained only in memory and isn't written to disk, WebSocket mode is compatible with store=false.
If a previous_response_id isn't in the in-memory cache, behavior depends on whether you store responses:
- With
store=true, the service might hydrate older response IDs from persisted state. Continuation still works but usually loses the in-memory latency benefit. - With
store=false, there's no persisted fallback. If the ID is uncached, the request returnsprevious_response_not_found.
If a turn fails (4xx or 5xx), the service evicts the referenced previous_response_id from the connection-local cache. This prevents reusing stale cached state for that failed continuation.
Compaction
If you use context compaction, there are two different continuation patterns.
Server-side compaction
When you enable server-side compaction (context_management with compact_threshold), compaction happens during normal /responses generation. In WebSocket mode, you continue the same way you normally do: send the next response.create with the latest previous_response_id and only new input items.
Standalone /responses/compact
The standalone /responses/compact endpoint returns a new compacted input window, not a response ID. After compaction, start a new response on your WebSocket connection by omitting previous_response_id (or setting it to null) and passing the compacted output as input, plus the next user or tool items. Pass the compacted output as-is; don't prune the returned window.
# Compact your current window (HTTP call)
compacted = client.responses.compact(
model="gpt-4.1",
input=long_input_items_array,
)
# Start a new response on the WebSocket using the compacted window
ws.send(json.dumps({
"type": "response.create",
"model": "gpt-4.1",
"store": False,
"input": [
*compacted.output,
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Continue from here."}],
},
],
"tools": [],
}))
Connection behavior and limits
- A single WebSocket connection can receive multiple
response.createmessages, but it runs them sequentially (one in-flight response at a time). - The connection doesn't support multiplexing. Use multiple connections if you need parallel runs.
- Connection duration is limited to 60 minutes. Reconnect when you reach the limit.
Reconnect and recover
When a connection closes or hits the 60-minute limit, open a new WebSocket connection and continue with one of these patterns:
- If your prior response is persisted (
store=true) and you have a valid response ID, continue withprevious_response_idand new input items. - If you can't continue the chain (for example,
store=falseorprevious_response_not_found), start a new response by omittingprevious_response_id(or setting it tonull) and send the full input context for the next turn. - If you compacted context with
/responses/compact, use the returned compacted window as the baseinputfor the new response, then append the latest user or tool items.
Troubleshooting
previous_response_not_found: The referenced response ID isn't in the connection-local cache and there's no persisted state to hydrate from. Start a new chain, or enablestore=trueif your scenario allows it.{ "type": "error", "status": 400, "error": { "code": "previous_response_not_found", "message": "Previous response with id 'resp_abc' not found.", "param": "previous_response_id" } }websocket_connection_limit_reached: The connection is open for 60 minutes. Open a new WebSocket connection and continue using one of the Reconnect and recover patterns.{ "type": "error", "status": 400, "error": { "type": "invalid_request_error", "code": "websocket_connection_limit_reached", "message": "Responses websocket connection limit reached (60 minutes). Create a new websocket connection to continue." } }