GPT Realtime Whisper overview

GPT Realtime Whisper is a streaming transcription model that converts live audio to text in real time. You can use it alongside speech-to-speech and translation models to provide continuous input transcription for audio streams.

Key capabilities

Streaming transcription: Transcribes live audio as it arrives, without waiting for the utterance to complete.
Low latency: Designed for real-time scenarios where delays aren't acceptable, such as live captions or quality monitoring.
Parallel operation: Runs alongside other realtime models (such as GPT Realtime Translate) to provide source-language transcription in parallel with translation.

When to use GPT Realtime Whisper

Use GPT Realtime Whisper when you need:

Live captions and subtitles for ongoing audio streams.
Transcription for monitoring, moderation, or analytics workflows.
Original-language speech captured alongside live translation experiences.
Text visibility into spoken input while other models process the audio.

Example use cases

Live event captioning: Provide real-time captions in the speaker's original language during conferences, webinars, or broadcasts.
Compliance and quality review: Capture the original conversation as text for regulatory compliance, quality assurance, or analytics.
Multilingual pipelines: Pair with GPT Realtime Translate to deliver both translated output and a source-language transcript in a single workflow.

Get started

GPT Realtime Whisper is available through the Realtime API. The connection and usage patterns are the same as for other realtime models:

Deployment and availability

GPT Realtime Whisper is available as a Global Standard (pay-as-you-go) deployment in Microsoft Foundry. Deploy the model from the model catalog.

Feedback

Was this page helpful?

Last updated on 2026-05-08