How to get WordBoundary event without synthesizing speech?

Abas Oladosu 0 Reputation points
2023-06-08T16:51:36.3+00:00

I am trying to get Wordboundary of text without synthesizing it. What i want to achieve is to determine the length of an audio prior to converting it to speech. For example, I want to know the duration of "Hello world" without converting it to speech.

I am building a program that converts a large texts into speech. I want to display how long the text will be played for before synthesis.

Any feedback will be appreciated.

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
{count} votes

1 answer

Sort by: Most helpful
  1. VasaviLankipalle-MSFT 18,721 Reputation points Moderator
    2023-06-08T23:05:42.0833333+00:00

    Hi @Abas Oladosu , Thanks for using Microsoft Q&A Platform.

    I don't think we can determine the duration of an audio without synthesizing it. Because, the duration of the audio is determined by the length of the speech that is synthesized, which in turn depends on the text that is being converted to speech and the speaking rate.

    The synthesis_word_boundary event is raised at the beginning of each new spoken word, punctuation, and sentence. The event reports the current word's time offset (in ticks) from the beginning of the output audio, as well as the character position in the input text immediately before the word that's about to be spoken. https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-csharp#subscribe-to-synthesizer-events

    Here is the sample code for word_boundary_event: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/python/console/speech_synthesis_sample.py

    def speech_synthesis_word_boundary_event():
        """performs speech synthesis and shows the word boundary event."""
        # Creates an instance of a speech config with specified subscription key and service region.
        speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    
        # Creates a speech synthesizer with a null output stream.
        # This means the audio output data will not be written to any output channel.
        # You can just get the audio from the result.
        speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
    
        # Subscribes to word boundary event
        # The unit of evt.audio_offset is tick (1 tick = 100 nanoseconds), divide it by 10,000 to convert to milliseconds.
        speech_synthesizer.synthesis_word_boundary.connect(lambda evt: print(
            "Word boundary event received: {}, audio offset in ms: {}ms".format(evt, evt.audio_offset / 10000)))
    
        # Receives a text from console input and synthesizes it to result.
        while True:
            print("Enter some text that you want to synthesize, Ctrl-Z to exit")
            try:
                text = input()
            except EOFError:
                break
            result = speech_synthesizer.speak_text_async(text).get()
            # Check result
            if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
                print("Speech synthesized for text [{}]".format(text))
                audio_data = result.audio_data
                print("{} bytes of audio data received.".format(len(audio_data)))
            elif result.reason == speechsdk.ResultReason.Canceled:
                cancellation_details = result.cancellation_details
                print("Speech synthesis canceled: {}".format(cancellation_details.reason))
                if cancellation_details.reason == speechsdk.CancellationReason.Error:
                    print("Error details: {}".format(cancellation_details.error_details))
    
    

    The outputUser's image

    Generally, this is the process to calculate word_boundary.

    If you still wanted to calculate the duration, then my suggestion is you can generally calculate based on the length of words divided by average speaking rate per minute (Assuming 150WPM). This is just an estimate of average speaking rate for English (taken from https://virtualspeech.com/blog/average-speaking-rate-words-per-minute) and the actual duration of the synthesized speech may vary depending on the specific text and the voice used for synthesis. Unfortunately, I can't give you the exact value.

    Try these and let us know.

    I hope this helps.

    Regards,
    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.