@Hirofumi Kojima Yes, this should be possible by subscribing to the WordBoundary events. This event is raised at the beginning of each new spoken word and will provide a time offset within the spoken stream and a text offset within the input prompt.
- AudioOffset reports the output audio's elapsed time between the beginning of synthesis and the start of the next word. This is measured in hundred-nanosecond units (HNS) with 10,000 HNS equivalent to 1 millisecond.
- WordOffset reports the character position in the input string (original text or SSML) immediately before the word that's about to be spoken.
You can also subscribe to viseme output along with word boundary to get the response similar to AWS polly's speech marks.