An Azure service that integrates speech processing into apps and services.
Hello Abdul Rehman,
Welcome to Microsoft Q&A.
Thank you for providing the detailed breakdown of the issue.
Based on your description, the behavior you observed aligns with known limitations in how Azure OpenAI currently handles transcription within the Realtime API, especially when compared to the standalone transcription endpoint.
From your testing, whisper-1 continued working normally, while gpt-4o-transcribe failed only during realtime usage in the window between 08:30 and 15:30. The repeated conversation.item.input_audio_transcription.failed events indicate a server-side failure in the realtime transcription pipeline specific to this model.
It’s important to clarify that Azure OpenAI’s Realtime API does not yet implement full support for server-side, Whisper-style transcription inside a realtime session. Even though the API accepts audio, the system does not reliably trigger the same transcription pathway used by the standalone endpoint. As a result, realtime transcription with gpt-4o-transcribe can fail intermittently, exactly as you experienced.
Regarding the specific time window of 08:30 to 15:30, this behavior is consistent with a temporary service degradation or instability affecting only the gpt-4o-transcribe realtime pathway. Because whisper-1 and the direct transcription endpoint were unaffected, this points to a model-specific backend disruption rather than any issue on your configuration or usage.
At this time, realtime transcription with gpt-4o-transcribe is not fully supported and may not behave the same way as standard transcription. Your expectation is valid, but today that functionality isn’t guaranteed by the platform.
For reliable realtime audio workflows, the recommended approach is to transcribe audio externally either client-side or by using the Whisper / transcription endpoint and then send the resulting text into the Realtime API. This dual-pipeline approach is currently the most stable option and is widely used in production. If realtime audio-to-text is required directly inside the session, whisper-1 is the more stable choice compared to gpt-4o-transcribe.
The failure you observed matches existing platform limitations and a temporary service degradation affecting gpt-4o-transcribe. No configuration changes on your side would have prevented the issue.
I Hope this helps. Do let me know if you have any further queries.
Thank you!