Share via

Azure OpenAI – Does gpt-4o-realtime-whisper support live transcription during realtime voice-to-voice sessions?

Muhammad Umar 0 Reputation points
2026-05-12T08:55:53.58+00:00

Hello,

We are currently using Azure OpenAI realtime voice-to-voice sessions together with whisper-1 / gpt-4o-transcribe for live transcription.

We would like to clarify whether Azure OpenAI currently supports using gpt-4o-realtime-whisper as the transcription model during active realtime sessions.

Specifically:

  • Can gpt-4o-realtime-whisper transcribe an ongoing realtime voice-to-voice session simultaneously?
  • Or is it currently limited to classic Speech-to-Text streaming scenarios only?

We tested this successfully on the OpenAI platform directly, but on Azure OpenAI it does not appear to work.

Could you clarify whether:

  • this feature is already supported on Azure,
  • region/API-version dependent,
  • or not yet available?

Thank you.

Azure OpenAI in Foundry Models

3 answers

Sort by: Most helpful
  1. SRILAKSHMI C 18,745 Reputation points Microsoft External Staff Moderator
    2026-05-13T07:57:01.9666667+00:00

    Hello @Muhammad Umar

    Thank you for Reaching out Microsoft Q&A.

    Based on the current Azure OpenAI implementation, the behavior you are seeing is expected.

    At this time, Azure OpenAI supports: • realtime voice-to-voice interactions through GPT-4o realtime models (such as gpt-4o-realtime-preview) • Whisper-based transcription using models like whisper-1 and gpt-4o-transcribe • streaming Speech-to-Text scenarios through Azure Speech services

    However, Azure OpenAI does not currently expose a dedicated gpt-4o-realtime-whisper model within the Azure realtime endpoint (/openai/v1/realtime) for simultaneous live transcription during active voice-to-voice sessions.

    This means: running realtime voice conversations and using gpt-4o-realtime-whisper as an integrated parallel transcription model is not currently supported in Azure OpenAI in the same way it may function on the public OpenAI platform.

    At present, if live transcription is required alongside a realtime voice session in Azure, the recommended approaches are:

    1. Run a parallel Speech-to-Text streaming connection You can use: Azure Speech SDK Conversation Transcription, standard Speech-to-Text streaming APIs, or whisper-1 / gpt-4o-transcribe alongside the realtime GPT-4o voice session.
    2. Monitor model availability in your Azure region Realtime model support in Azure OpenAI is dependent on: region availability, API version, deployment type, and staged rollout status.

    When/if gpt-4o-realtime-whisper becomes available in Azure OpenAI, it would appear in the supported model list for your region and API version.

    Based on current availability, the integrated realtime Whisper functionality appears to be available today on: the public OpenAI platform, and certain Microsoft Foundry scenarios but not yet fully exposed through Azure OpenAI realtime APIs.

    Thank you!

    Was this answer helpful?

    0 comments No comments

  2. kagiyama yutaka 3,415 Reputation points
    2026-05-12T10:25:57.3366667+00:00

    I think Azure doesn’t list gpt‑4o‑realtime‑whisper for live input_audio_transcription, so realtime voice‑to‑voice transcription isn’t available with that model right now. whisper‑1 / gpt‑4o‑transcribe works reliably today, and swapping them in is all u need to keep the session stable.

    Was this answer helpful?

    0 comments No comments

  3. AI answer

    2026-05-12T08:56:13.9066667+00:00

    GPT Realtime Whisper in Azure OpenAI is designed specifically for realtime, streaming transcription of live audio and is available through the Realtime API, not only as a classic Speech-to-Text file API.

    From the provided information:

    • GPT Realtime Whisper is a streaming transcription model that “converts live audio to text in real time” and is intended to run “alongside other realtime models (such as GPT Realtime Translate) to provide source-language transcription in parallel with translation.” This explicitly covers ongoing, low‑latency, realtime scenarios such as live captions and monitoring.
    • Realtime sessions opt into transcription via the input_audio_transcription property in a session.update event. When a transcription model such as whisper-1 is configured there, the service emits conversation.item.audio_transcription.completed events during the session.
    • The documentation shows GPT Realtime Whisper as part of the same Realtime API surface as other GPT realtime models, and it is deployed as a Global Standard (pay‑as‑you‑go) model from the model catalog in Microsoft Foundry.

    However, the context provided does not explicitly state that gpt-4o-realtime-whisper can be selected as the input_audio_transcription model for an existing GPT realtime voice‑to‑voice session in Azure OpenAI, nor does it list that exact model name as a supported value in the Realtime session configuration. The context only names whisper-1 and the gpt-4o-*-transcribe family in that role.

    Therefore, based on the available documentation:

    • Azure OpenAI supports realtime streaming transcription via GPT Realtime Whisper through the Realtime API.
    • Realtime sessions can be configured for live transcription using input_audio_transcription with supported models such as whisper-1.
    • The documentation provided does not confirm that gpt-4o-realtime-whisper is currently a supported input_audio_transcription model for Azure GPT realtime voice‑to‑voice sessions, nor whether this is region or API‑version dependent.

    If gpt-4o-realtime-whisper is not working as an input_audio_transcription model in a Realtime session while whisper-1 works, this is not clarified in the supplied documentation and may indicate a current limitation or a configuration/availability issue that is not documented here.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.