An Azure communication platform for deploying applications across devices and platforms.
Hi Vivek Anand
This is a known and reproducible issue pattern with Azure AI Foundry Voice Live Agents over ACS WebSocket,
why this happens, what causes the disconnects and 5+ second delay, and what you should change immediately to stabilize the agent.
error actually means:
{
"type": "failed",
"error": {
"message": "Failed to create message",
"type": "invalid_request_error",
"code": "message_creation_failed"
}
}
Key point: This happens after the call is connected — during message creation, not call setup.
Voice Live only accepts plain-text chat messages (and raw audio frames). If your agent tries to send:
- Structured JSON
- Tool output objects
- Non-string values (PortableValue, arrays, dicts)
- Extremely large text blobs
- Rapid back-to-back
response.createevents
the runtime cannot serialize the message, and it closes the WebSocket, causing the disconnect.
This same behavior is seen in other Foundry agent failures when non-text payloads are returned at the final boundary
Every message sent must be:
{
"type": "response.create",
"response": {
"modalities": ["text", "audio"],
"instructions": "This is a plain text string"
}
}
Common mistake Sending tool output, parsed values, or objects directly.
you see >5 seconds latency before disconnect:
This delay is not network latency — it’s server-side buffering.
What happens internally:
- ACS streams audio → Voice Live
- Voice Live buffers audio awaiting:
- turn detection
- valid message creation
- An invalid message event arrives
- Runtime retries creation
- Retry timeout expires (~5–6 seconds)
- WebSocket is forcibly closed
- turn detection
This aligns with the delay behavior reported in the same Q&A thread and in Voice Live ACS bridge scenarios.
Fix 1: Ensure only plain text reaches Voice Live
Before sending any message to response.create:
- Convert content explicitly:
-
str(value)-
JSON.stringify(value) - Remove tool metadata
-
-
- Truncate large responses (recommended <8–10 KB)
If you are using Foundry workflows or agents: Add a “Set Variable” or stringify step before Voice Live output
Fix 2: Send exactly one response per user turn
Voice Live does not handle multiple overlapping responses well.
Ensure:
- One
response.createper detected user turn - No parallel sends
- No “thinking” or intermediate responses
Many disconnects happen due to rapid, concurrent message creation.
Microsoft explicitly recommends that the first event after WebSocket connect is:
{
"type": "session.update"
}
I recommend you do next (practical steps)
- Log every outbound Voice Live message
- Verify it is:
- string
- non-empty
- <10 KB
- Add a final stringify step before sending
- Ensure single response per turn
- Confirm region alignment
- Test with a text-only response (no audio) once to validate stability
- non-empty
- string
If you want, you can paste:
- your
response.createpayload - or your agent’s final output structure
You’re hitting a real product edge, not a mistake on your side — but it is resolvable with the above fixes
I Hope this helps. Do let me know if you have any further queries.
Thankyou!