Speech Service: Only 30 seconds of audio is transcribed and sometimes just the first 5 seconds

Sarthak Agarwal 16 Reputation points
2021-01-17T11:12:32.897+00:00

Hello,
I am using the Speech Cognitive Service and passing an audio file to get the transcription file along with the Pronunciation score. But, only the first 30-35 seconds of the audio is processed and for a few cases, just the first 5 seconds, possibly because of the silence.

Is there a config that allows us to process the entire file?

@Doug Bergman @YutongTie-5848

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,382 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,351 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Sarthak Agarwal 16 Reputation points
    2021-01-30T12:48:06.263+00:00

    Thank you @traviswilson @Ramr-msft !

    Yes, can you help me change the configuration of "RecognizeOnce" with a larger duration? We were successfully able to implement the "StartContinuous" method, but it gave us the pronunciation assessment for each individual chunk. And we are looking to have the score for the entire file at once, instead of us having additional logic (which would be a non-standard weighted average of the individual chunks.
    Hope that makes sense.

    Audio File: https://drive.google.com/file/d/18JUyjgPxyzV6yyan3Ahz49s80ktoG4iK/view?usp=sharing

    Below is my Python code:

    audio_input = speechsdk.AudioConfig(filename=folder + file_name)
    pronunciation_assessment_config = speechsdk.PronunciationAssessmentConfig(reference_text="Actual Text spoken in the audio file",
    grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
    granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
    pronunciation_assessment_config.apply_to(speech_recognizer)
    result = speech_recognizer.recognize_once()

    Best,
    Sarthak

    0 comments No comments

  2. Ramr-msft 17,606 Reputation points
    2021-02-01T11:55:42.327+00:00

    @Sarthak Agarwal Thanks for the details. I would recommend using the Offline/Batch Transcription for larger duration. This https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/from-blob?pivots=programming-language-csharp explains how to transcribe audio files that are in storage (offline aka batch transcription). Samples are available in our github sample repository (C# and python https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/batch). You don’t need to be constantly connected to the service, you submit jobs and collect the results at a later point in time, the audio files can have a length of several hours. The functionality is REST based. A new version of the API will be available that will allow you to submit many files (or a container) in one REST request.

    For large files, Here is updated GitHub repo.

    This repo we added some same sample code to demo the Speech to Text SDK.
    We are checking internally for score to the entire file.

    0 comments No comments

  3. Sarthak Agarwal 16 Reputation points
    2021-02-04T11:28:23.573+00:00

    Thank you for the response, @Ramr-msft

    My main use case is to assess Pronunciation and looks like the above is for transcript generation? I also looked at the Properties that the batch API expects, couldn't find Pronunciation there.

    What would you suggest we do for assessing the pronunciation of 4 minutes long audio files?

    Best,
    Sarthak

    0 comments No comments