Whisper model limitation

Question

Whisper model limitation

Hendi Rodrigoz 20

I want to switch my project to whisper model but I have not gathering all needed information, like once switch to whisper, any limitation or concern I should considerate? Any experience sharing will be helpful

Accepted answer

1 additional answer

Your answer

Answer 1

@Hendi Rodrigoz

Thanks for reaching out to us, if you can share more details about your scenario then we can provide more information regarding to it.

Whisper is an Automatic Speech Recognition (ASR) system developed by OpenAI. It is trained on a large amount of multilingual and multitask supervised data collected from the web. Here are some aspects and potential limitations you might consider:

Language Support: Make sure that the Whisper ASR system supports all the languages you need for your project. As it's trained on data from the web, it's likely to have broad support, but there could be limitations for certain languages or dialects.
Accuracy: While Whisper is designed to be a highly accurate ASR system, accuracy can vary based on factors like the speaker's accent, the audio quality, the presence of background noise, and the context of the speech.
Real-Time Processing: If you need real-time speech recognition (for example, for live transcription), you'll need to check whether Whisper can support this in your specific use case.
Data Privacy: Like all ASR systems, Whisper processes audio data, which can have significant privacy implications. Make sure you fully understand OpenAI's data usage policies and that they align with your project's requirements.
Cost: Using Whisper will have associated costs. You should review OpenAI's pricing details to understand what these will be for your expected usage volume.
Integration Effort: Switching to a new ASR system may require significant changes to your codebase and could introduce new bugs or issues. You'll need to plan for adequate testing and debugging time.
Deprecation of Old Models: OpenAI has mentioned that it will be deprecating older models in favor of Whisper ASR. So if you are using an older model, you may need to switch to Whisper sooner or later. Remember, it's always a good idea to do a pilot run with a small portion of your data or a test project before fully committing to a new tool or service. This will give you a better idea of its performance and any potential issues you might face.

Please take a look on it and consider the different points so that you can make the right decision.

Regards,

Yutong

-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

Answer 2

Hello @Hendi Rodrigoz

Welcome to Microsoft QnA!

Please refer to this Documentation:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/whisper-overview

As we can read :

If you decide to use the Whisper model, you have two options. You can choose whether to use the Whisper Model via Azure OpenAI or via Azure AI Speech. In either case, the readability of the transcribed text is the same. You can input mixed language audio and the output is in English.

Whisper Model via Azure OpenAI Service might be best for:

Quickly transcribing audio files one at a time
Translate audio from other languages into English
Provide a prompt to the model to guide the output
Supported file formats: mp3, mp4, mpweg, mpga, m4a, wav, and webm

Whisper Model via Azure AI Speech might be best for:

Transcribing files larger than 25MB (up to 1GB). The file size limit for the Azure OpenAI Whisper model is 25 MB.
Transcribing large batches of audio files
Diarization to distinguish between the different speakers participating in the conversation. The Speech service provides information about which speaker was speaking a particular part of transcribed speech. The Whisper model via Azure OpenAI doesn't support diarization.
Word-level timestamps
Supported file formats: mp3, wav, and ogg
Customization of the Whisper base model to improve accuracy for your scenario (coming soon)

Regional support is another consideration.

The Whisper model via Azure OpenAI Service is available in the following regions: North Central US and West Europe.
The Whisper model via Azure AI Speech is available in the following regions: East US, Southeast Asia, and West Europe. If you decide to use the Whisper model, you have two options. You can choose whether to use the Whisper Model via Azure OpenAI or via Azure AI Speech. In either case, the readability of the transcribed text is the same. You can input mixed language audio and the output is in English. Whisper Model via Azure OpenAI Service might be best for:
- Quickly transcribing audio files one at a time
- Translate audio from other languages into English
- Provide a prompt to the model to guide the output
- Supported file formats: mp3, mp4, mpweg, mpga, m4a, wav, and webm

Whisper Model via Azure AI Speech might be best for:

Transcribing files larger than 25MB (up to 1GB). The file size limit for the Azure OpenAI Whisper model is 25 MB.
Transcribing large batches of audio files
Diarization to distinguish between the different speakers participating in the conversation. The Speech service provides information about which speaker was speaking a particular part of transcribed speech. The Whisper model via Azure OpenAI doesn't support diarization.
Word-level timestamps
Supported file formats: mp3, wav, and ogg
Customization of the Whisper base model to improve accuracy for your scenario (coming soon)

Regional support is another consideration.

The Whisper model via Azure OpenAI Service is available in the following regions: North Central US and West Europe.
The Whisper model via Azure AI Speech is available in the following regions: East US, Southeast Asia, and West Europe.

--

I hope this helps!

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards

Share via

Whisper model limitation

1 additional answer

Your answer