Edit

GPT Realtime Whisper overview

GPT Realtime Whisper is a streaming transcription model that converts live audio to text in real time. You can use it alongside speech-to-speech and translation models to provide continuous input transcription for audio streams.

Key capabilities

  • Streaming transcription: Transcribes live audio as it arrives, without waiting for the utterance to complete.
  • Low latency: Designed for real-time scenarios where delays aren't acceptable, such as live captions or quality monitoring.
  • Parallel operation: Runs alongside other realtime models (such as GPT Realtime Translate) to provide source-language transcription in parallel with translation.

When to use GPT Realtime Whisper

Use GPT Realtime Whisper when you need:

  • Live captions and subtitles for ongoing audio streams.
  • Transcription for monitoring, moderation, or analytics workflows.
  • Original-language speech captured alongside live translation experiences.
  • Text visibility into spoken input while other models process the audio.

Example use cases

  • Live event captioning: Provide real-time captions in the speaker's original language during conferences, webinars, or broadcasts.
  • Compliance and quality review: Capture the original conversation as text for regulatory compliance, quality assurance, or analytics.
  • Multilingual pipelines: Pair with GPT Realtime Translate to deliver both translated output and a source-language transcript in a single workflow.

Get started

GPT Realtime Whisper is available through the Realtime API. The connection and usage patterns are the same as for other realtime models:

Deployment and availability

GPT Realtime Whisper is available as a Global Standard (pay-as-you-go) deployment in Microsoft Foundry. Deploy the model from the model catalog.