question

SarthakAgarwal-6706 avatar image
2 Votes"
SarthakAgarwal-6706 asked ZyadOmer-3517 commented

Speech Service: Only 30 seconds of audio is transcribed and sometimes just the first 5 seconds

Hello,
I am using the Speech Cognitive Service and passing an audio file to get the transcription file along with the Pronunciation score. But, only the first 30-35 seconds of the audio is processed and for a few cases, just the first 5 seconds, possibly because of the silence.

Is there a config that allows us to process the entire file?

@DougBergman-1312 @YutongTie-5848

azure-cognitive-servicesazure-speech
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you for the reply, @ramr-msft
The same file when uploaded to the above site is showing the entire string. My response ID is '1d68a7c4a7c34e2c9479156dcef342da'. Would you be able to check it?

Also, I have set the with the SpeechServiceConnection_EndSilenceTimeoutMs property to a very big value as shown in the below Python code:

speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, '15000000')

Thanking you,
Sarthak

0 Votes 0 ·

Did you find any solution?

0 Votes 0 ·
SarthakAgarwal-6706 avatar image
0 Votes"
SarthakAgarwal-6706 answered

Thank you @traviswilson @ramr-msft !

Yes, can you help me change the configuration of "RecognizeOnce" with a larger duration? We were successfully able to implement the "StartContinuous" method, but it gave us the pronunciation assessment for each individual chunk. And we are looking to have the score for the entire file at once, instead of us having additional logic (which would be a non-standard weighted average of the individual chunks.
Hope that makes sense.

Audio File: https://drive.google.com/file/d/18JUyjgPxyzV6yyan3Ahz49s80ktoG4iK/view?usp=sharing

Below is my Python code:

audio_input = speechsdk.AudioConfig(filename=folder + file_name)
pronunciation_assessment_config = speechsdk.PronunciationAssessmentConfig(reference_text="Actual Text spoken in the audio file",
grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
pronunciation_assessment_config.apply_to(speech_recognizer)
result = speech_recognizer.recognize_once()


Best,
Sarthak

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ramr-msft avatar image
0 Votes"
ramr-msft answered ramr-msft edited

@SarthakAgarwal-6706 Thanks for the details. I would recommend using the Offline/Batch Transcription for larger duration. This https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/from-blob?pivots=programming-language-csharp explains how to transcribe audio files that are in storage (offline aka batch transcription). Samples are available in our github sample repository (C# and python https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/batch). You don’t need to be constantly connected to the service, you submit jobs and collect the results at a later point in time, the audio files can have a length of several hours. The functionality is REST based. A new version of the API will be available that will allow you to submit many files (or a container) in one REST request.

For large files, Here is updated GitHub repo.

This repo we added some same sample code to demo the Speech to Text SDK.
We are checking internally for score to the entire file.


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

SarthakAgarwal-6706 avatar image
0 Votes"
SarthakAgarwal-6706 answered

Thank you for the response, @ramr-msft

My main use case is to assess Pronunciation and looks like the above is for transcript generation? I also looked at the Properties that the batch API expects, couldn't find Pronunciation there.

What would you suggest we do for assessing the pronunciation of 4 minutes long audio files?

Best,
Sarthak

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.