Using azure openai realtime API, I sometimes get no audio response even though I specify multimodal output

Question

Using azure openai realtime API, I sometimes get no audio response even though I specify multimodal output

Tim 0

I setup a text-input audio-output realtime api session as follows:

    this.translationWs = await OpenAIRealtimeWS.azure(
      new AzureOpenAI({
        apiKey: process.env.AZURE_OPENAI_API_KEY,
        endpoint: process.env.AZURE_OPENAI_ENDPOINT,
        apiVersion: "2024-10-01-preview",
        deployment: "gpt-realtime",
      })
    );

    this.translationWs.socket.on("open", () => {
      this.sendTranslationMessage({
        type: "session.update",
        session: {
          modalities: ["text", "audio"],
          instructions: myCustomInstructions,
          output_audio_format: "g711_ulaw",
          temperature: 0.6,
        },
      });
    });

Then, after making sure I receive the session.updated event, I start passing input text in:

      this.translationWs.send({
        type: "conversation.item.create",
        item: {
          type: "message",
          role: "user",
          content: [
            {
              type: "input_text",
              text: myCustomText,
            },
          ],
        },
      });
      this.translationWs.send({
        type: "response.create",
        response: { modalities: ["text", "audio"], conversation: "none" },
      });

I expect to always receive audio response and this happens most of the time. However, on some occasions, I only receive response.text.delta events without response.audio.delta or response.audio_transcript.delta events.

The other difference I see is that when it's working as expected, I receive the following event:

{
  type: 'response.content_part.added',
  event_id: 'event_xxx',
  response_id: 'resp_xxx',
  item_id: 'item_xxx',
  output_index: 0,
  content_index: 0,
  part: { type: 'audio', transcript: '' },
  content: { type: 'audio', transcript: '' }
}

When the text-only bug happens, the same event becomes like this:

{
  type: 'response.content_part.added',   
  event_id: 'event_xxx',
  response_id: 'resp_xxx',
  item_id: 'item_xxx', 
  output_index: 0,
  content_index: 0,
  part: { type: 'text', text: '' },      
  content: { type: 'text', text: '' }    
}

Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-09-22T03:56:03.3833333+00:00

Good day.
I understand you're seeing intermittent missing audio output even though you’ve requested multimodal (“text + audio”) responses using the Real-Time WebSocket API. Is this issue persistent or happens only sometines ?
Tim 0 Reputation points

2025-09-22T05:10:54.1766667+00:00

@Nikhil Jha (Accenture International Limited) this happens only occasionally and randomly. I cannot predict when the bug happens.

1 answer

Your answer

Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-09-22T03:56:03.3833333+00:00

Good day.
I understand you're seeing intermittent missing audio output even though you’ve requested multimodal (“text + audio”) responses using the Real-Time WebSocket API. Is this issue persistent or happens only sometines ?
Tim 0 Reputation points

2025-09-22T05:10:54.1766667+00:00

@Nikhil Jha (Accenture International Limited) this happens only occasionally and randomly. I cannot predict when the bug happens.

Answer 1

Nikhil Jha (Accenture International Limited) 4,230 Microsoft External Staff Moderator

Hello Tim,

I understand your question regarding intermittent missing audio output from the Azure OpenAI Realtime WebSocket session even though you set the session and response modalities to ["text","audio"].

The intermittent failure to receive an audio response is likely due to a subtle race condition within the Azure OpenAI real-time model's generation process, not an error in your code's logic. When you request both "text" and "audio", the model prioritizes starting the response with the modality that is ready first to minimize latency.

In most cases, the Text-to-Speech (TTS) engine starts promptly, and the first event you see is response.content_part.added with part: { type: 'audio' }. This locks in the response as multimodal. However, occasionally, the core language model might generate the initial text chunk before the TTS engine is fully primed.
In these instances, it sends a response.content_part.added with part: { type: 'text' }. When this happens, the system seems to commit to a text-only stream for that specific response, ignoring the audio modality you requested.

Workaround:
To guarantee an audio response every time, you should explicitly request only the audio modality in your response.create call.

this.translationWs.send({
  type: "response.create",
  // By requesting only 'audio', you force the model to generate the audio stream.
  // The text transcript will be sent alongside the audio chunks.
  response: { modalities: ["audio"], conversation: "none" }, 
});

For Reference:

1)https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/realtime-audio
2)https://learn.microsoft.com/en-us/azure/ai-foundry/openai/realtime-audio-quickstart?tabs=keyless%2Cwindows&pivots=ai-foundry-portal

Please let us know if this helps. If yes, kindly "Accept the answer" and/or upvote, so it will be beneficial to others in the community as well. 😊

Tim 0

I tried that and got this error:

{
  type: 'error',
  event_id: 'event_xxx',
  error: {
    type: 'invalid_request_error',
    code: 'invalid_value',
    message: "Invalid modalities: ['audio']. Supported combinations are: ['text'] and ['audio', 'text'].",
    param: 'response.modalities',
    event_id: null
  }
}

Nikhil Jha (Accenture International Limited) 4,230 Microsoft External Staff Moderator

Hi Tim,

This Limitation Exists because;

The Realtime API is designed for "speech in, speech out" conversational interactions.
Text transcripts are essential for conversation state management.
The underlying model architecture requires text components for proper functioning.

Sharing few more workaround to try
1)Request multimodal responses and extract only audio:

this.translationWs.send({
  type: "response.create",
  response: { modalities: ["text", "audio"], conversation: "none" },
});

// Handle events - only process audio, ignore text
this.translationWs.on('response.audio.delta', (event) => {
  // Process audio chunks - this is what you want
  this.handleAudioChunk(event.delta);
});

this.translationWs.on('response.audio_transcript.delta', (event) => {
  // Process audio transcript if needed for logging/debugging
  console.log('Audio transcript:', event.delta);
});

// Ignore text-only events completely
this.translationWs.on('response.text.delta', (event) => {
  // Do nothing - discard text-only responses
  console.log('Discarding text-only response');
});

2)Use separate TTS pipeline for guaranteed audio:

// Request text-only for consistency
this.translationWs.send({
  type: "response.create",
  response: { modalities: ["text"], conversation: "none" },
});

// Convert all text responses to audio via Azure TTS
this.translationWs.on('response.text.done', (event) => {
  this.convertToAudio(event.text);
});

Disclaimer: This is a sample template code edit it according to requirements.
Please let me know if this helps.

Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-09-26T05:28:16.5766667+00:00

Hello Tim,
I hope this has been helpful! We appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!

Allan Hu （胡雅伦） 0

I think I also met this issue recently when using the gpt-realtime model via the azure openai.

In my case I use the Python SDK. I will capture the response.audio.delta. At most of the cases, it works fine, I can recieve both the audio + text response. However, sometimes, I can only receive the text response, no audio.delta is obtained. I noticed it happend if the response is very short/brief.

In my case, I think I need both the "response.audio.delta" and "response.audio_transcript.delta" as the AI's repsonse, so set the modality to audio only may not work for my case?


async for event in self.role_play_connection:
            if event is None:
                continue
            match event.type:

                case "error":
                    logger.error(f"Error: {event.error}")
                case "response.created":
                    logger.info("AI is responding...")
                case "response.done":
                    logger.info(f"Response Done. Response Id: {event.response.id} Chat Status: Waiting for next input")
                    response_txt_buffer = ""
                    if event.response.status_details:
                        logger.info(f"Status Details: {event.response.status_details.model_dump_json()}")
                    self.response_finished.set()  # allow new audio input

                case "response.audio_transcript.done":
                    logger.info(f"AI full Response: {event.transcript}")
                    await self.send_text_to_acs(event.transcript, AcsTextType.AI_RESPONSE)
                    self.chat_history.append({"role": "assistant", "content": event.transcript})
                    self.response_finished.set()  # allow new audio input
                case "response.audio.delta":
                    await self.send_audio_to_acs(event.delta)
                case "response.audio_transcript.delta":
                    response_txt_buffer += event.delta
                    # TODO: see how to steam text back to ACS
                case _:
                    logger.info(f"Unhandled event type: {event.type}")

Allan Hu （胡雅伦） 0 Reputation points

2025-12-15T07:45:29.9866667+00:00
BTW, I tried to set the "modalities" parameter to ["audio"] only to have a try, but after that I got the following error:

ERROR - Error: RealtimeError(message="Invalid modalities: ['audio']. Supported combinations are: ['text'] and ['audio', 'text'].", type='invalid_request_error', code='invalid_value', event_id=None, param='session.modalities')

Share via

Using azure openai realtime API, I sometimes get no audio response even though I specify multimodal output

1 answer

Your answer