Text to Speech with timestamp in JSON format

Hirofumi Kojima 21 Reputation points


Does Azure text-to-speech (TTS) have a feature similar to Amazon Polly speech marks?
For example, given a text, it will provide the following output.

input: "Mary had a little lamb."
output (json format): {"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
# " 0" and "23" are the timestamps in milliseconds.

Since I'm thinking of converting this json file to srt file for using subtitles, If Azure TTS has a feature to output a json file like the one above, I would appreciate it if you could let me know.


Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,487 questions
0 comments No comments
{count} votes

Accepted answer
  1. romungi-MSFT 43,526 Reputation points Microsoft Employee

    @Hirofumi Kojima Yes, this should be possible by subscribing to the WordBoundary events. This event is raised at the beginning of each new spoken word and will provide a time offset within the spoken stream and a text offset within the input prompt.

    • AudioOffset reports the output audio's elapsed time between the beginning of synthesis and the start of the next word. This is measured in hundred-nanosecond units (HNS) with 10,000 HNS equivalent to 1 millisecond.
    • WordOffset reports the character position in the input string (original text or SSML) immediately before the word that's about to be spoken.

    You can also subscribe to viseme output along with word boundary to get the response similar to AWS polly's speech marks.

1 additional answer

Sort by: Most helpful
  1. Samir 16 Reputation points

    @romungi-MSFT I have tried to capture speech marks using WordBoundey however I am not able to receive the event. Here is post I have crated with detailed explanation. Would you able to guide me what might be wrong in my code?



    0 comments No comments