Text to Speech with timestamp in JSON format

Question

Text to Speech with timestamp in JSON format

Hirofumi Kojima 21

Hi,

Does Azure text-to-speech (TTS) have a feature similar to Amazon Polly speech marks?
For example, given a text, it will provide the following output.

input: "Mary had a little lamb."
output (json format): {"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
# " 0" and "23" are the timestamps in milliseconds.

Since I'm thinking of converting this json file to srt file for using subtitles, If Azure TTS has a feature to output a json file like the one above, I would appreciate it if you could let me know.

Regards.

Accepted answer

2 additional answers

Your answer

Answer 1

romungi-MSFT 48,906 Microsoft Employee Moderator

@Hirofumi Kojima Yes, this should be possible by subscribing to the WordBoundary events. This event is raised at the beginning of each new spoken word and will provide a time offset within the spoken stream and a text offset within the input prompt.

AudioOffset reports the output audio's elapsed time between the beginning of synthesis and the start of the next word. This is measured in hundred-nanosecond units (HNS) with 10,000 HNS equivalent to 1 millisecond.
WordOffset reports the character position in the input string (original text or SSML) immediately before the word that's about to be spoken.

You can also subscribe to viseme output along with word boundary to get the response similar to AWS polly's speech marks.

Fathy Eltanany 6 Reputation points

2022-01-20T08:09:43.1+00:00

@romungi-MSFT

I have tried Wordboundary and it return the bellow result
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, text_offset=192, word_length=9), audio offset in ms: 50.0ms

is there is anyway to return the input word with the audiooffset

Answer 2

Samir 21

@romungi-MSFT I have tried to capture speech marks using WordBoundey however I am not able to receive the event. Here is post I have crated with detailed explanation. Would you able to guide me what might be wrong in my code?

https://learn.microsoft.com/en-us/answers/questions/849161/azure-text-to-speech-synthesizerwordboundary-metho.html

Thanks,
Samir

Answer 3

import azure.cognitiveservices.speech as speechsdk

speech_key = ""
service_region = ""

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_voice_name = "en-US-AvaMultilingualNeural"

text = "Mary had a little lamb."

audio_output = speechsdk.audio.AudioOutputConfig(filename="audio.wav")
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)

with open("word_boundaries.txt", "w") as file:
    def word_boundary_handler(evt):
        file.write(f'Word: {evt.text}, Start: {evt.audio_offset / 10000} ms, Duration: {evt.duration.total_seconds()*1000}, Start(length): {evt.text_offset}, Duration(length): {evt.word_length}, Type: {str(evt.boundary_type).split(".")[-1]} \n')

    speech_synthesizer.synthesis_word_boundary.connect(word_boundary_handler)
    result = speech_synthesizer.speak_text_async(text).get()

    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesized for text [{}]".format(text))
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print("Speech synthesis canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))

Mustansir Bohari 0 Reputation points

2024-09-26T06:31:30.2366667+00:00

This worked for me in python

Share via

Text to Speech with timestamp in JSON format

2 additional answers

Your answer