Hello Tim,
I understand your question regarding intermittent missing audio output from the Azure OpenAI Realtime WebSocket session even though you set the session and response modalities to ["text","audio"].
The intermittent failure to receive an audio response is likely due to a subtle race condition within the Azure OpenAI real-time model's generation process, not an error in your code's logic. When you request both "text" and "audio", the model prioritizes starting the response with the modality that is ready first to minimize latency.
In most cases, the Text-to-Speech (TTS) engine starts promptly, and the first event you see is response.content_part.added with part: { type: 'audio' }. This locks in the response as multimodal. However, occasionally, the core language model might generate the initial text chunk before the TTS engine is fully primed.
In these instances, it sends a response.content_part.added with part: { type: 'text' }. When this happens, the system seems to commit to a text-only stream for that specific response, ignoring the audio modality you requested.
Workaround:
To guarantee an audio response every time, you should explicitly request only the audio modality in your response.create call.
this.translationWs.send({
type: "response.create",
// By requesting only 'audio', you force the model to generate the audio stream.
// The text transcript will be sent alongside the audio chunks.
response: { modalities: ["audio"], conversation: "none" },
});
For Reference:
1)https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio
2)https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-quickstart?tabs=keyless%2Cwindows&pivots=ai-foundry-portal
Please let us know if this helps. If yes, kindly "Accept the answer" and/or upvote, so it will be beneficial to others in the community as well. 😊