Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
GPT Realtime Whisper is a streaming transcription model that converts live audio to text in real time. You can use it alongside speech-to-speech and translation models to provide continuous input transcription for audio streams.
Key capabilities
- Streaming transcription: Transcribes live audio as it arrives, without waiting for the utterance to complete.
- Low latency: Designed for real-time scenarios where delays aren't acceptable, such as live captions or quality monitoring.
- Parallel operation: Runs alongside other realtime models (such as GPT Realtime Translate) to provide source-language transcription in parallel with translation.
When to use GPT Realtime Whisper
Use GPT Realtime Whisper when you need:
- Live captions and subtitles for ongoing audio streams.
- Transcription for monitoring, moderation, or analytics workflows.
- Original-language speech captured alongside live translation experiences.
- Text visibility into spoken input while other models process the audio.
Example use cases
- Live event captioning: Provide real-time captions in the speaker's original language during conferences, webinars, or broadcasts.
- Compliance and quality review: Capture the original conversation as text for regulatory compliance, quality assurance, or analytics.
- Multilingual pipelines: Pair with GPT Realtime Translate to deliver both translated output and a source-language transcript in a single workflow.
Get started
GPT Realtime Whisper is available through the Realtime API. The connection and usage patterns are the same as for other realtime models:
Deployment and availability
GPT Realtime Whisper is available as a Global Standard (pay-as-you-go) deployment in Microsoft Foundry. Deploy the model from the model catalog.