GPT Realtime Whisper in Azure OpenAI is designed specifically for realtime, streaming transcription of live audio and is available through the Realtime API, not only as a classic Speech-to-Text file API.
From the provided information:
- GPT Realtime Whisper is a streaming transcription model that “converts live audio to text in real time” and is intended to run “alongside other realtime models (such as GPT Realtime Translate) to provide source-language transcription in parallel with translation.” This explicitly covers ongoing, low‑latency, realtime scenarios such as live captions and monitoring.
- Realtime sessions opt into transcription via the
input_audio_transcription property in a session.update event. When a transcription model such as whisper-1 is configured there, the service emits conversation.item.audio_transcription.completed events during the session.
- The documentation shows GPT Realtime Whisper as part of the same Realtime API surface as other GPT realtime models, and it is deployed as a Global Standard (pay‑as‑you‑go) model from the model catalog in Microsoft Foundry.
However, the context provided does not explicitly state that gpt-4o-realtime-whisper can be selected as the input_audio_transcription model for an existing GPT realtime voice‑to‑voice session in Azure OpenAI, nor does it list that exact model name as a supported value in the Realtime session configuration. The context only names whisper-1 and the gpt-4o-*-transcribe family in that role.
Therefore, based on the available documentation:
- Azure OpenAI supports realtime streaming transcription via GPT Realtime Whisper through the Realtime API.
- Realtime sessions can be configured for live transcription using
input_audio_transcription with supported models such as whisper-1.
- The documentation provided does not confirm that
gpt-4o-realtime-whisper is currently a supported input_audio_transcription model for Azure GPT realtime voice‑to‑voice sessions, nor whether this is region or API‑version dependent.
If gpt-4o-realtime-whisper is not working as an input_audio_transcription model in a Realtime session while whisper-1 works, this is not clarified in the supplied documentation and may indicate a current limitation or a configuration/availability issue that is not documented here.
References: