How to have multiple mstts:audioduration in a single <speak>?

Lucas 0 Reputation points
2024-05-02T21:27:23.34+00:00

I'm trying to adjust the duration of individual phrases so that the synthesized voice matches with the voice in the original audio.

It's working perfectly when done like this:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="pt-BR">
<voice name="pt-BR-FranciscaNeural">
<mstts:audioduration value="3580ms"/>
Com esse ingrediente você faz uma massa de pastel incrível!
</voice>
</speak>

But, when I try to add two "mstts:audioduration", each one for a specific phrase, there's no adjustment being made at all (even the first phrase loses its adjustment).
I'll show below how my SSML is:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="pt-BR">
<voice name="pt-BR-FranciscaNeural">
<mstts:audioduration value="3580ms"/>
Com esse ingrediente você faz uma massa de pastel incrível!
</voice>
<voice name="pt-BR-BrendaNeural">
<mstts:audioduration value="2040ms"/>
Eu fiz de frango, creme de ricota e tomate.
</voice>
</speak>

I thought that I could use "mstts:audioduration" for multiple <voices> in a <speak>, as I read in: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#adjust-the-audio-duration

Is that really possible? And, if so, what am I doing wrong?

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,431 questions
{count} votes

1 answer

Sort by: Most helpful
  1. dupammi 7,135 Reputation points Microsoft Vendor
    2024-05-03T05:11:09.3433333+00:00

    Hi @Lucas

    Thank you for your question.

    It's absolutely possible to use multiple "mstts:audioduration" tags for different phrases within a single <speak> element in Microsoft Azure Cloud's Speech Synthesis Markup Language (SSML). Based on the code snippets you provided, the syntax appears correct.

    However, it seems you're encountering an issue where the audio duration adjustments aren't being applied when multiple <voice> elements are present. This could be due to several factors, such as syntax errors or the audioduration values being outside the acceptable range.

    Below is what I tried to repro based on the <speak> tag code snippets you mentioned while posting the question. It seems that the syntax of your SSML code is correct.

    Python script for SSML given in question (for Portuguese sentence):

    import os
    import azure.cognitiveservices.speech as speechsdk
    speech_key, service_region = "YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION"
    # Create a speech configuration with your subscription key and service region
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    # Set the speech recognition language
    speech_config.speech_recognition_language = "pt-BR"
    # Set the audio output config
    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    # Create a SpeechSynthesizer object
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    # Synthesize the SSML
    ssml = """
    <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="pt-BR">
    <voice name="pt-BR-FranciscaNeural">
    <mstts:audioduration value="3580ms"/>
    Com esse ingrediente você faz uma massa de pastel incrível!
    </voice>
    <voice name="pt-BR-BrendaNeural">
    <mstts:audioduration value="2040ms"/>
    Eu fiz de frango, creme de ricota e tomate.
    </voice>
    </speak>
    """
    # Synthesize the SSML asynchronously
    result = speech_synthesizer.speak_ssml_async(ssml).get()
    

    To troubleshoot further, I recommend verifying the SSML syntax to ensure each "mstts:audioduration" tag is correctly nested within its respective <voice> element. Additionally, ensure that the audioduration values are realistic and within the acceptable range.

    As per documentation, it's important to note that the value of the "mstts:audioduration" attribute should be within 0.5 to 2 times the original audio without any other rate settings. For example, if the requested duration of your audio is 30s, then the original audio must otherwise be between 15 and 60 seconds. If you set a value outside of these boundaries, the duration is set according to the respective minimum or maximum multiple. Please check if you were setting this within the acceptable range.

    Python code for English sentences:

    import os
    import azure.cognitiveservices.speech as speechsdk
    speech_key, service_region = "YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION"
    # Create a speech configuration with your subscription key and service region
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    # Set the speech recognition language
    speech_config.speech_recognition_language = "en-US"
    # Set the audio output config
    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    # Create a SpeechSynthesizer object
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    # Synthesize the SSML
    ssml = """
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AvaMultilingualNeural">
    <mstts:audioduration value="20s"/>
    If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
    A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
    </voice>
    <voice name="en-US-RyanMultilingualNeural">
    <mstts:audioduration value="15s"/>
    If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
    A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
    </voice>
    </speak>
    """
    # Synthesize the SSML asynchronously
    result = speech_synthesizer.speak_ssml_async(ssml).get()
    

    For english the audioduration that I was setting in the code above, seems to be acceptable, especially for voice "en-US-RyanMultilingualNeural" Vs "en-US-AvaMultilingualNeural".For more details, please refer adjust-the-audio-duration and mstts-audio-duration-examples It's important to ensure that the audio duration is within the acceptable range, which is 0.5 to 2 times the original audio duration, and to use the appropriate language code, either en-US or pt-BR, in the respective codes.

    Due to limitation of files that can be attached on this thread, I am unable to upload the audio files for the above inputs. However, it is working fine for me.

    I hope you understand. Thank you.

    0 comments No comments