Speech Service: Only 30 seconds of audio is transcribed and sometimes just the first 5 seconds

Question

Hello,
I am using the Speech Cognitive Service and passing an audio file to get the transcription file along with the Pronunciation score. But, only the first 30-35 seconds of the audio is processed and for a few cases, just the first 5 seconds, possibly because of the silence.

Is there a config that allows us to process the entire file?

@Doug Bergman @YutongTie-5848

Answer

Thank you @traviswilson @Ramr-msft !

Yes, can you help me change the configuration of "RecognizeOnce" with a larger duration? We were successfully able to implement the "StartContinuous" method, but it gave us the pronunciation assessment for each individual chunk. And we are looking to have the score for the entire file at once, instead of us having additional logic (which would be a non-standard weighted average of the individual chunks.
Hope that makes sense.

Audio File: https://drive.google.com/file/d/18JUyjgPxyzV6yyan3Ahz49s80ktoG4iK/view?usp=sharing

Below is my Python code:

audio_input = speechsdk.AudioConfig(filename=folder + file_name)
pronunciation_assessment_config = speechsdk.PronunciationAssessmentConfig(reference_text="Actual Text spoken in the audio file",
grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
pronunciation_assessment_config.apply_to(speech_recognizer)
result = speech_recognizer.recognize_once()

Best,
Sarthak

Answer

@Sarthak Agarwal Thanks for the details. I would recommend using the Offline/Batch Transcription for larger duration. This https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/from-blob?pivots=programming-language-csharp explains how to transcribe audio files that are in storage (offline aka batch transcription). Samples are available in our github sample repository (C# and python https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/batch). You don’t need to be constantly connected to the service, you submit jobs and collect the results at a later point in time, the audio files can have a length of several hours. The functionality is REST based. A new version of the API will be available that will allow you to submit many files (or a container) in one REST request.

For large files, Here is updated GitHub repo.

This repo we added some same sample code to demo the Speech to Text SDK.
We are checking internally for score to the entire file.

For large files, Here is updated GitHub repo.

This repo we added some same sample code to demo the Speech to Text SDK.
We are checking internally for score to the entire file.

Answer

Thank you for the response, @Ramr-msft

My main use case is to assess Pronunciation and looks like the above is for transcript generation? I also looked at the Properties that the batch API expects, couldn't find Pronunciation there.

What would you suggest we do for assessing the pronunciation of 4 minutes long audio files?

Best,
Sarthak

Speech Service: Only 30 seconds of audio is transcribed and sometimes just the first 5 seconds

3 answers