Azure Speech To Text: How to get Recognized speech in continuous recognition?

yujing 45 Reputation points
2023-08-21T14:52:35.54+00:00

Hello, I have a requirement in my app that the response from the speech-to-text need to be Recognized speech**.**

Such as , Instead of receiving "I have 100 apples.", I want to get the response like this "I have one hundred apples."test

However, I can only get the "ITN" results, even though I've already set the output formate into "detailed".

Here is my code:

I used the continuous recognition and the wav file was downloaded through url.

val speechConfig: SpeechConfig = SpeechConfig.fromSubscription(azureKey, azureRegion)
speechConfig.speechRecognitionLanguage = lang
speechConfig.outputFormat = OutputFormat.Detailed // Here!! I used the Detailed format

val fileByteArray = downloadFun(url)
val ps = AudioInputStream.createPushStream(AudioStreamFormat.getWaveFormatPCM(samplesPerSecond, bitsPerSample, channels))

val audioInput  = AudioConfig.fromStreamInput(ps) 
val speechRecognizer = SpeechRecognizer(speechConfig, audioInput)

val stopTranslationWithFileSemaphore: Semaphore = Semaphore(0)

speechRecognizer.recognized.addEventListener { _: Any?, e: SpeechRecognitionEventArgs ->
	if (e.result.reason == ResultReason.RecognizedSpeech) {
		println(e.result) // Here!!!! I can only get the ITN result
	}
}

...[other codes]

speechRecognizer.startContinuousRecognitionAsync().get();
ps.write(fileByteArray)        
ps.close()
stopTranslationWithFileSemaphore.acquire()

speechRecognizer.stopContinuousRecognitionAsync().get()         
speechRecognizer.close()         
audioInput.close()
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,061 questions
0 comments No comments
{count} votes

Accepted answer
  1. VasaviLankipalle-MSFT 18,676 Reputation points Moderator
    2023-08-22T00:22:58.4066667+00:00

    Hello @yujing , Thanks for using Microsoft Q&A Platform.

    I hope you are looking something similar to lexical format output. You can refer to this code to display the detailed recognition results as per your requirement: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/0b66f2ad693f7f6840221f98d61f524813e96d0a/samples/python/console/speech_sample.py#L98

    def speech_recognize_once_from_file_with_detailed_recognition_results():
        """performs one-shot speech recognition with input from an audio file, showing detailed recognition results
        including word-level timing """
        # <SpeechRecognitionFromFileWithDetailedRecognitionResults>
        speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    
        # Ask for detailed recognition result
        speech_config.output_format = speechsdk.OutputFormat.Detailed
    
        # If you also want word-level timing in the detailed recognition results, set the following.
        # Note that if you set the following, you can omit the previous line
        #   "speech_config.output_format = speechsdk.OutputFormat.Detailed",
        # since word-level timing implies detailed recognition results.
        speech_config.request_word_level_timestamps()
    
        audio_config = speechsdk.audio.AudioConfig(filename=weatherfilename)
    
        # Creates a speech recognizer using a file as audio input, also specify the speech language
        speech_recognizer = speechsdk.SpeechRecognizer(
            speech_config=speech_config, language="en-US", audio_config=audio_config)
    
        # Starts speech recognition, and returns after a single utterance is recognized. The end of a
        # single utterance is determined by listening for silence at the end or until a maximum of 15
        # seconds of audio is processed. It returns the recognition text as result.
        # Note: Since recognize_once() returns only a single utterance, it is suitable only for single
        # shot recognition like command or query.
        # For long-running multi-utterance recognition, use start_continuous_recognition() instead.
        result = speech_recognizer.recognize_once()
    
        # Check the result
        if result.reason == speechsdk.ResultReason.RecognizedSpeech:
            print("Recognized: {}".format(result.text))
    
            # Time units are in hundreds of nanoseconds (HNS), where 10000 HNS equals 1 millisecond
            print("Offset: {}".format(result.offset))
            print("Duration: {}".format(result.duration))
    
            # Now get the detailed recognition results from the JSON
            json_result = json.loads(result.json)
    
            # The first cell in the NBest list corresponds to the recognition results
            # (NOT the cell with the highest confidence number!)
            print("Detailed results - Lexical: {}".format(json_result['NBest'][0]['Lexical']))
            # ITN stands for Inverse Text Normalization
            print("Detailed results - ITN: {}".format(json_result['NBest'][0]['ITN']))
            print("Detailed results - MaskedITN: {}".format(json_result['NBest'][0]['MaskedITN']))
            print("Detailed results - Display: {}".format(json_result['NBest'][0]['Display']))
    
            # Print word-level timing. Time units are HNS.
            words = json_result['NBest'][0]['Words']
            print("Detailed results - Word timing:\nWord:\tOffset:\tDuration:")
            for word in words:
                print(f"{word['Word']}\t{word['Offset']}\t{word['Duration']}")
    
            # You can access alternative recognition results through json_result['NBest'][i], i=1,2,..
    
        elif result.reason == speechsdk.ResultReason.NoMatch:
            print("No speech could be recognized: {}".format(result.no_match_details))
        elif result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = result.cancellation_details
            print("Speech Recognition canceled: {}".format(cancellation_details.reason))
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                print("Error details: {}".format(cancellation_details.error_details))
        # </SpeechRecognitionFromFileWithDetailedRecognitionResults>
    
    

    I have reproduced the above code and here is the sample output:

    User's image

    I hope this helps.

    Regards,
    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.