Azure OpenAI gpt-realtime generating voice response despite text-only setting

Question

Azure OpenAI gpt-realtime generating voice response despite text-only setting

Saurabh M 0

The Azure OpenAI gpt-realtime service is sometimes generating voice responses and transcriptions even when the modalities are set to ["text"].

In the responses, text such as <|vq_hbr_audio_8233|> is being returned, despite the request for text-only modalities. Additionally, there is a new parameter called output_modalities in the OpenAI gpt-realtime API, but using this parameter results in an error.

Assistance is requested to resolve this issue.

Thanks

Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-09-22T04:19:24.6733333+00:00

Good day.
Could you please share the model and region you are using?
And the JSON you send to the endpoint, especially where you set modalities (or output_modalities) and any other relevant parameters.

Saurabh M 0

At start we send

{

    type: "session.update",

    session: {

      "modalities": ["text"],

      "instructions":prompt,

      "voice": "shimmer",

      "input_audio_format": "pcm16",

      "output_audio_format": "pcm16",

      "input_audio_transcription": {

        "model": "whisper-1"

      },

      "turn_detection": null,

      "tools": tools,

      "tool_choice": "auto",

      "temperature": 0.8,

      "max_response_output_tokens": 250

    }

}

and to generate a response we are sending

{

    type: "response.create",

    response: {

      modalities: ["text"]

    }

}

this works for most of the conversation, however every once in a while it will respond with text like this <|vq_hbr_audio_9508|> in response.text.delta

1 answer

Your answer

Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-09-22T04:19:24.6733333+00:00

Good day.
Could you please share the model and region you are using?
And the JSON you send to the endpoint, especially where you set modalities (or output_modalities) and any other relevant parameters.
Saurabh M 0 Reputation points

2025-09-22T05:20:07.25+00:00

At start we send

{

type: "session.update", session: { "modalities": ["text"], "instructions":prompt, "voice": "shimmer", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": { "model": "whisper-1" }, "turn_detection": null, "tools": tools, "tool_choice": "auto", "temperature": 0.8, "max_response_output_tokens": 250 }

}

and to generate a response we are sending

{

type: "response.create", response: { modalities: ["text"] }

}

this works for most of the conversation, however every once in a while it will respond with text like this <|vq_hbr_audio_9508|> in response.text.delta

Answer 1

Nikhil Jha (Accenture International Limited) 4,230 Microsoft External Staff Moderator

Hello Saurabh M,

Your JSON clearly requests text-only modalities, yet occasional audio markers (<|…|>) appear because the session configuration still includes audio-related settings.

Workaround:
Even though you set "modalities": ["text"], your session.update includes both input_audio_format and output_audio_format. The Realtime API treats the presence of output_audio_format as an implicit request for audio capabilities, causing the service to insert audio tokens.

To enforce text-only behavior, try to eliminate audio-related parameters from your session configuration:

{
  type: "session.update",
  session: {
    "modalities": ["text"],
    // Remove these two fields entirely:
    // "input_audio_format": "pcm16",
    // "output_audio_format": "pcm16",
    "voice": null,                          // Optional: clear voice setting
    …
  }
}

As you noted: Don’t Use output_modalities Parameter The new output_modalities field is not yet supported in this preview and might return an error. Continue using modalities only.

Please let us know if this helps. If yes, kindly "Accept the answer" and/or upvote, so it will be beneficial to others in the community as well. 😊

Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-09-26T05:31:53.1466667+00:00

Hello Saurabh M,
I hope this has been helpful! We appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!
Saurabh M 0 Reputation points

2025-09-26T05:41:21.8633333+00:00

Hi Team,

This seems to have helped. We have removed the input and output audio format but setting the voice to null is not possible currently. (Error: Session update failed: Invalid type for 'session.voice': expected one of one of 'alloy', 'ash', 'ballad', 'coral', 'echo', 'sage', 'shimmer', 'verse', 'marin', or 'cedar' or object, but got null instead.)

We have seen just one stray incident of this after making these changes, will keep observing and let you know.
Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-10-03T10:25:21.8033333+00:00

Hi Saurabh M,
Will get in touch with you in a while to find solution.
Nikhil Jha (Accenture International Limited) 4,230 Reputation points Microsoft External Staff Moderator

2025-10-07T04:12:06.79+00:00

Hi Saurabh M,
Could you please follow the documentation for correct whisper model usage:
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whisper-quickstart?tabs=command-line%2Cpython-new%2Ckeyless%2Ctypescript-keyless&pivots=programming-language-python

Share via

Azure OpenAI gpt-realtime generating voice response despite text-only setting

1 answer

Your answer