Issue with gpt-4o-transcribe detecting wrong language (Chinese/Malay/Tamil/English mix-up)

Question

Issue with gpt-4o-transcribe detecting wrong language (Chinese/Malay/Tamil/English mix-up)

Rosemary Raphael 5

We are using gpt-4o-transcribe for our voice project. It was working fine previously, but for the past week our customers have been facing issues with the voice model. When they speak in English, it’s being transcribed as Chinese and when they speak in Chinese, it’s being interpreted as Malay and so on. We haven’t made any changes on our end, so we would like to know if there have been any recent updates or changes to the gpt-4o-transcribe model that could explain this behavior.

1 answer

Your answer

Answer 1

Sridhar M 2,690 Microsoft External Staff Moderator

Hi Rosemary Raphael,

Thank you for reaching out on the Microsoft Q&A.

Model Version Retirement & Transition: The version of gpt-4o-transcribe released in March 2025 (2025-03-20) was scheduled for retirement on October 15, 2025. Customers using that version are now being forced to upgrade to a newer build. This transition appears to have introduced instability in language detection.
Known Bug with Language Enforcement: There is an acknowledged bug where gpt-4o-transcribe ignores or inconsistently applies the language parameter. Even when you specify "language": "en", the model sometimes switches to other languages (Chinese, Malay, etc.). This is because the language hint is treated as a “soft preference,” not a strict rule.
Community Reports of Wrong Language Output : Multiple developers have reported that since early October, English audio is being transcribed as Chinese or Malay, and Chinese as other languages. This aligns with your timeline.
Realtime API Changes: A new variant gpt-4o-transcribe-latest was introduced in the Realtime API, but it’s not fully stable yet. Some users report missing or misaligned transcription events when using the new GA session https://community.openai.com/t/realtime-transcription-mismatch-and-gpt-4o-transcribe-latest/1358789
The retirement of the old version and rollout of a new backend likely changed the language detection pipeline.
The model is multilingual by design, and without strict enforcement, it guesses based on acoustic cues. If your audio has background noise or mixed-language phrases, the bug amplifies misclassification. You Can Do Now 1.Check Your Deployment Version In Azure Portal → OpenAI Deployments, confirm if you’re still on 2025-03-20. If yes, you need to update to the latest version once it appears in the dropdown.
1. Force Language via Prompting Add a strong instruction in the prompt field:
```
      "prompt": "The audio is entirely in English. Transcribe only in English."
```
  This helps, but does not fully fix the bug.
2. Set Temperature Low Use temperature: 0.2 and avoid chunking. Some developers report better consistency with these settings. [community.openai.com]
3. Fallback Option If accuracy is critical, consider temporarily switching to Whisper-1 for strict language handling until the new gpt-4o-transcribe build stabilizes.

References:https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new

https://learn.microsoft.com/en-us/azure/ai-services/translator/reference/known-issues https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-identification?tabs=once&pivots=programming-language-csharp

Feel free to accept this as an answer.

Thank you

Sridhar M 2,690 Reputation points Microsoft External Staff Moderator

2025-10-14T02:49:48.4866667+00:00

Hi Rosemary Raphael,

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!
Sridhar M 2,690 Reputation points Microsoft External Staff Moderator

2025-10-15T12:25:01.3233333+00:00

Hi Rosemary Raphael,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!
Rosemary Raphael 5 Reputation points

2025-10-17T11:44:26.4333333+00:00

Hi Sridhar M,

We are trying out the options you mentioned. We’ll update you on this.

Thank you.
Sridhar M 2,690 Reputation points Microsoft External Staff Moderator

2025-10-17T11:48:05.5166667+00:00

Hi Rosemary Raphael,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!
Rosemary Raphael 5 Reputation points

2025-10-27T06:21:13.7266667+00:00
Hi Sridhar M,

We’ve tested various suggestions and here are our findings:

Temperature and Format Adjustments:

Temp 0.2, WAV format: Despite setting the temperature to 0.2, the model still transcribes sounds like “Mmm,” “Hmm,” or “Haaa” as Chinese, even when the speech is in English or Tamil.

Temp 0, MP3 format: The issue persists with MP3, even though WAV format worked better. Non-English sounds are still being transcribed as Chinese.

Prompt added: When adding a prompt, if we accidentally release the “Press to Talk” button in the UI without speaking anything, it processes the last given prompt as input.

Temp 0.1: Setting the temperature to 0.1 caused the model to transcribe "financial assistance" (spoken in Tamil) into Tamil.

No prompt: Removing the prompt entirely caused the model to mix multiple languages, including Korean and Spanish.

Whisper Model:

Tamil transcription: The Whisper model did not transcribe Tamil accurately and instead replaced Tamil words with incorrect ones.

English recognition: It failed to recognize some English words and transcribed them incorrectly.

We’re seeking advice on any other potential tuning options or techniques that could help improve the language detection and transcription accuracy of the model.
Sridhar M 2,690 Reputation points Microsoft External Staff Moderator

2025-10-27T08:03:26.17+00:00

Hi Rosemary Raphael,

Temperature Settings

Lowering the temperature (0–0.2) reduces randomness but does not fully prevent misclassification of non-speech sounds like “Mmm” or “Hmm.” The best practice is to keep the temperature at 0 for deterministic outputs and combine this with explicit language hints to minimize confusion.

Language Biasing

Instead of relying on auto-detection, explicitly set the language parameter in your API calls (e.g., language="ta" for Tamil or language="en" for English). If your application supports multiple languages, consider dynamic language hints based on user selection in the UI. This approach significantly improves accuracy.

Non-Speech Filtering

Sounds such as “Mmm” or “Haaa” often trigger incorrect language detection. To address this, apply Voice Activity Detection (VAD) or energy-based thresholds to remove silence and non-speech segments before sending audio to the model. This pre-processing step reduces false positives.

Prompt Handling

Accidental button release causing the last prompt to be reused is a UI logic issue. Implement a null-check for audio length before sending requests. If no audio is captured, discard the request instead of reusing the previous prompt to avoid unintended behavior.

Audio Format

Your observation that WAV performs better than MP3 is correct. Always use WAV (16 kHz PCM) for optimal transcription quality. Avoid lossy formats like MP3, as they degrade speech clarity and impact recognition accuracy.

Whisper Model Limitations

Whisper struggles with Tamil transcription and sometimes misrecognizes English words. To improve this, consider fine-tuning Whisper on Tamil datasets or use Azure Custom Speech with Tamil-specific training data. For English, upload custom vocabulary or phrase lists to enhance recognition.

Advanced Techniques

For multilingual scenarios, use language ID models (e.g., fastText or Azure Language Detection) on audio chunks before transcription to route audio to the correct model. Alternatively, split audio into segments and run parallel transcription with different language models, then merge intelligently.

Azure-Specific Options

If you are using Azure Speech Service, enable features like StablePartialResults for better consistency and AutoDetectSourceLanguageConfig with a restricted language list (e.g., ["en-IN", "ta-IN"]) to avoid interference from unrelated languages like Korean or Spanish. Also, explore Custom Speech for domain-specific adaptation.
Rosemary Raphael 5 Reputation points

2025-10-27T12:18:15.6866667+00:00

Hi Sridhar M,

Thanks for the detailed suggestions. I’ll try implementing them and will let you know how it works out. If I run into any issues or need further clarification, I’ll reach out.

Share via

Issue with gpt-4o-transcribe detecting wrong language (Chinese/Malay/Tamil/English mix-up)

1 answer

Your answer