Text to speech - time synchronization

Question

Text to speech - time synchronization

sam husson 1

Hello,

Im using the Speech Studio>Audio Content Creation tool to produce audio files with ssml. I need to get the audio files synchronized in different languages. In an ssml document with 2 sentences, is there a way to set the starting time of the second sentence, taking as a reference the first sentence?

Example:
<speak>
<par>
<media xml:id="test" begin="0.5s">
<speak>This is the first sentence</speak>
</media>
<media xml:id="answer" begin="test.end+2.0s">
<speak>This second sentence starts 2 seconds after the begining of the first sentence.</speak>
</media>
</par>
</speak>

If this is not possible, is there another option, with neural voices?

Many thanks for help

2 answers

Your answer

Answer 1

romungi-MSFT 48,906 Microsoft Employee Moderator

@sam husson The tool has an option to break i.e set a time in ms to wait for the next sentence. I tried this out using the text mode with two different voices and it seems to wait until the end of first sentence before pronouncing the second one.

sam husson 1 Reputation point

2021-02-10T12:30:45.18+00:00

Thank you for your answer.
Yes this function, in ssml that would be <break time="2s" />, makes it possible to set a time to wait until the next sentence.
However that does not help for synchronizing 2 different languages, since we need to set the same time, in each language version, when the sentence is pronounced. As an example, you need 20% more time in Spanish, to say the same thing than in English (text expansion). So if we use the same break time that would not help.

On this page it says that the prosody element is used to specify changes to pitch, contour, range, rate, duration, and volume for the text-to-speech output. However i tried the duration and it does not seem to work with neural voices.
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp

I was wondering if there were another way to time synchronize the same sentence, translated, in different languages, using ssml.
Many thanks!
romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2021-02-10T14:15:05.997+00:00

@sam husson I am curious to understand your use case scenario of synchronizing the two audio files. Are you looking to create separate audio files and play them at once? I think this feature of synchronization is not available now directly on the tool but we can provide feedback to our team based on your use case.

Yes, the duration is only available for standard voices.

duration The period of time that should elapse while the speech synthesis (TTS) service reads the text, in seconds or milliseconds. For example, 2s or 1800ms. Duration supports standard voices only. Optional

Answer 2

sam husson 1

yes, exactly, the idea would be to create 2 different audio files, in two different languages. Time synchronization is needed for synchronizing with other contents, such as music or video.
Again, thanks!

Share via

Text to speech - time synchronization

2 answers

Your answer