Using azure openai realtime API, I sometimes get no audio response even though I specify multimodal output

Tim 0 Reputation points
2025-09-21T05:35:39.9866667+00:00

I setup a text-input audio-output realtime api session as follows:

    this.translationWs = await OpenAIRealtimeWS.azure(
      new AzureOpenAI({
        apiKey: process.env.AZURE_OPENAI_API_KEY,
        endpoint: process.env.AZURE_OPENAI_ENDPOINT,
        apiVersion: "2024-10-01-preview",
        deployment: "gpt-realtime",
      })
    );

    this.translationWs.socket.on("open", () => {
      this.sendTranslationMessage({
        type: "session.update",
        session: {
          modalities: ["text", "audio"],
          instructions: myCustomInstructions,
          output_audio_format: "g711_ulaw",
          temperature: 0.6,
        },
      });
    });

Then, after making sure I receive the session.updated event, I start passing input text in:

      this.translationWs.send({
        type: "conversation.item.create",
        item: {
          type: "message",
          role: "user",
          content: [
            {
              type: "input_text",
              text: myCustomText,
            },
          ],
        },
      });
      this.translationWs.send({
        type: "response.create",
        response: { modalities: ["text", "audio"], conversation: "none" },
      });

I expect to always receive audio response and this happens most of the time. However, on some occasions, I only receive response.text.delta events without response.audio.delta or response.audio_transcript.delta events.

The other difference I see is that when it's working as expected, I receive the following event:

{
  type: 'response.content_part.added',
  event_id: 'event_xxx',
  response_id: 'resp_xxx',
  item_id: 'item_xxx',
  output_index: 0,
  content_index: 0,
  part: { type: 'audio', transcript: '' },
  content: { type: 'audio', transcript: '' }
}

When the text-only bug happens, the same event becomes like this:

{
  type: 'response.content_part.added',   
  event_id: 'event_xxx',
  response_id: 'resp_xxx',
  item_id: 'item_xxx', 
  output_index: 0,
  content_index: 0,
  part: { type: 'text', text: '' },      
  content: { type: 'text', text: '' }    
}
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
{count} votes

1 answer

Sort by: Most helpful
  1. Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator
    2025-09-24T04:40:05.71+00:00

    Hello Tim,

    I understand your question regarding intermittent missing audio output from the Azure OpenAI Realtime WebSocket session even though you set the session and response modalities to ["text","audio"].

    The intermittent failure to receive an audio response is likely due to a subtle race condition within the Azure OpenAI real-time model's generation process, not an error in your code's logic. When you request both "text" and "audio", the model prioritizes starting the response with the modality that is ready first to minimize latency.

    In most cases, the Text-to-Speech (TTS) engine starts promptly, and the first event you see is response.content_part.added with part: { type: 'audio' }. This locks in the response as multimodal. However, occasionally, the core language model might generate the initial text chunk before the TTS engine is fully primed.
    In these instances, it sends a response.content_part.added with part: { type: 'text' }. When this happens, the system seems to commit to a text-only stream for that specific response, ignoring the audio modality you requested.

    Workaround:
    To guarantee an audio response every time, you should explicitly request only the audio modality in your response.create call.

    this.translationWs.send({
      type: "response.create",
      // By requesting only 'audio', you force the model to generate the audio stream.
      // The text transcript will be sent alongside the audio chunks.
      response: { modalities: ["audio"], conversation: "none" }, 
    });
    

    For Reference:

    1)https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio
    2)https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-quickstart?tabs=keyless%2Cwindows&pivots=ai-foundry-portal


    Please let us know if this helps. If yes, kindly "Accept the answer" and/or upvote, so it will be beneficial to others in the community as well. 😊


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.