How to have multiple mstts:audioduration in a single <speak>?

Question

How to have multiple mstts:audioduration in a single <speak>?

Lucas 0

I'm trying to adjust the duration of individual phrases so that the synthesized voice matches with the voice in the original audio.

It's working perfectly when done like this:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="pt-BR">
<voice name="pt-BR-FranciscaNeural">
<mstts:audioduration value="3580ms"/>
Com esse ingrediente você faz uma massa de pastel incrível!
</voice>
</speak>

But, when I try to add two "mstts:audioduration", each one for a specific phrase, there's no adjustment being made at all (even the first phrase loses its adjustment).
I'll show below how my SSML is:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="pt-BR">
<voice name="pt-BR-FranciscaNeural">
<mstts:audioduration value="3580ms"/>
Com esse ingrediente você faz uma massa de pastel incrível!
</voice>
<voice name="pt-BR-BrendaNeural">
<mstts:audioduration value="2040ms"/>
Eu fiz de frango, creme de ricota e tomate.
</voice>
</speak>

I thought that I could use "mstts:audioduration" for multiple <voices> in a <speak>, as I read in: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#adjust-the-audio-duration

Is that really possible? And, if so, what am I doing wrong?

dupammi 8,615 Reputation points Microsoft External Staff

2024-05-03T20:39:11.6933333+00:00

Hi @Lucas

We haven’t heard from you on the last response and was just checking back to see if you got a chance to check my above suggestions. Thank you.

1 answer

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2024-05-03T20:39:11.6933333+00:00

Hi @Lucas

We haven’t heard from you on the last response and was just checking back to see if you got a chance to check my above suggestions. Thank you.

Answer 1

Hi @Lucas

Thank you for your question.

It's absolutely possible to use multiple "mstts:audioduration" tags for different phrases within a single <speak> element in Microsoft Azure Cloud's Speech Synthesis Markup Language (SSML). Based on the code snippets you provided, the syntax appears correct.

However, it seems you're encountering an issue where the audio duration adjustments aren't being applied when multiple <voice> elements are present. This could be due to several factors, such as syntax errors or the audioduration values being outside the acceptable range.

Below is what I tried to repro based on the <speak> tag code snippets you mentioned while posting the question. It seems that the syntax of your SSML code is correct.

Python script for SSML given in question (for Portuguese sentence):

import os
import azure.cognitiveservices.speech as speechsdk
speech_key, service_region = "YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION"
# Create a speech configuration with your subscription key and service region
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
# Set the speech recognition language
speech_config.speech_recognition_language = "pt-BR"
# Set the audio output config
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
# Create a SpeechSynthesizer object
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
# Synthesize the SSML
ssml = """
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="pt-BR">
<voice name="pt-BR-FranciscaNeural">
<mstts:audioduration value="3580ms"/>
Com esse ingrediente você faz uma massa de pastel incrível!
</voice>
<voice name="pt-BR-BrendaNeural">
<mstts:audioduration value="2040ms"/>
Eu fiz de frango, creme de ricota e tomate.
</voice>
</speak>
"""
# Synthesize the SSML asynchronously
result = speech_synthesizer.speak_ssml_async(ssml).get()

To troubleshoot further, I recommend verifying the SSML syntax to ensure each "mstts:audioduration" tag is correctly nested within its respective <voice> element. Additionally, ensure that the audioduration values are realistic and within the acceptable range.

As per documentation, it's important to note that the value of the "mstts:audioduration" attribute should be within 0.5 to 2 times the original audio without any other rate settings. For example, if the requested duration of your audio is 30s, then the original audio must otherwise be between 15 and 60 seconds. If you set a value outside of these boundaries, the duration is set according to the respective minimum or maximum multiple. Please check if you were setting this within the acceptable range.

Python code for English sentences:

import os
import azure.cognitiveservices.speech as speechsdk
speech_key, service_region = "YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION"
# Create a speech configuration with your subscription key and service region
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
# Set the speech recognition language
speech_config.speech_recognition_language = "en-US"
# Set the audio output config
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
# Create a SpeechSynthesizer object
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
# Synthesize the SSML
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-AvaMultilingualNeural">
<mstts:audioduration value="20s"/>
If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
</voice>
<voice name="en-US-RyanMultilingualNeural">
<mstts:audioduration value="15s"/>
If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
</voice>
</speak>
"""
# Synthesize the SSML asynchronously
result = speech_synthesizer.speak_ssml_async(ssml).get()

For english the audioduration that I was setting in the code above, seems to be acceptable, especially for voice "en-US-RyanMultilingualNeural" Vs "en-US-AvaMultilingualNeural".For more details, please refer adjust-the-audio-duration and mstts-audio-duration-examples It's important to ensure that the audio duration is within the acceptable range, which is 0.5 to 2 times the original audio duration, and to use the appropriate language code, either en-US or pt-BR, in the respective codes.

Due to limitation of files that can be attached on this thread, I am unable to upload the audio files for the above inputs. However, it is working fine for me.

I hope you understand. Thank you.

Share via

How to have multiple mstts:audioduration in a single <speak>?

1 answer

Your answer