How to add break times is Azure SSML

Jakub Chudiak 0

how can i add breaks in ssml without deticating voice tag for each like this beacuse limit is 50

<?xml version="1.0" encoding="UTF-8"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-EN">
    <voice name="en-GB-RyanNeural"><break time="1230ms"/></voice>
	<voice name="en-GB-RyanNeural"><mstts:audioduration value="2440ms"/>This is a jammy Margarita.</voice>
    <voice name="en-GB-RyanNeural"><break time="1850ms"/></voice>
	<voice name="en-GB-RyanNeural"><mstts:audioduration value="4600ms"/>Let me show you a little secret ingredient. 2 teaspoons of your favourite jam.</voice>
    <voice name="en-GB-RyanNeural"><break time="4790ms"/></voice>
</speak>

but i also want to keep duration of my text

navba-MSFT 23,095 Microsoft Employee

@Jakub Chudiak Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

You can simplify your SSML by using the <prosody> tag to add breaks and control the duration of your text without needing a separate <voice> tag for each break. Here’s an example of how you can do it:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">

    <voice name="en-GB-RyanNeural">

        <prosody rate="0%" pitch="0%">

            <break time="1230ms"/>

            <mstts:audioduration value="2440ms"/>This is a jammy Margarita.

            <break time="1850ms"/>

            <mstts:audioduration value="4600ms"/>Let me show you a little secret ingredient. 2 teaspoons of your favourite jam.

            <break time="4790ms"/>

        </prosody>

    </voice>

</speak>

More info here.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Jakub Chudiak 0

@navba-MSFT
this also doesnt work for some reason now length of audio is 8 seconds and it should be 14. break times and durations are incorrect

<?xml version="1.0" encoding="UTF-8"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-EN">
	<voice name="en-GB-RyanNeural">

        <prosody rate="0%" pitch="0%">

            <break time="1230ms"/>

            <mstts:audioduration value="2440ms"/>This is a jammy Margarita.

            <break time="1850ms"/>

            <mstts:audioduration value="4600ms"/>Let me show you a little secret ingredient. 2 teaspoons of your favourite jam.

            <break time="4790ms"/>

        </prosody>
    </voice>
</speak>

navba-MSFT 23,095 Microsoft Employee

@Jakub Chudiak Thanks for getting back. I tested with the below sample SSML from speech studio and it worked fine.

User's image

Please update the rate attribute as any of the below depending on your requirement:

A constant value:
- x-slow (equivalently 0.5, -50%)
- slow (equivalently 0.64, -46%)
- medium (equivalently 1, default value)
- fast (equivalently 1.55, +55%)
- x-fast (equivalently 2, +100%)

<?xml version="1.0" encoding="UTF-8"?>

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-EN">

    <voice name="en-GB-RyanNeural">

        <break time="1230ms"/>

        <prosody rate="medium">This is a jammy Margarita.</prosody>

        <break time="1850ms"/>

        <prosody rate="medium">Let me show you a little secret ingredient. 2 teaspoons of your favourite jam.</prosody>

        <break time="4790ms"/>

    </voice>

</speak>

Jakub Chudiak 0 Reputation points

2024-08-01T06:28:28.35+00:00

@navba-MSFT
hi but now my audio is longer i really need exact duration like in using mstts:audioduration beacuse now my audio is longer i dont mind having mstts:audioduration is separate voice tags but i need those breaks inside so i wont dedicate separate voice tag for that break
Jakub Chudiak 0 Reputation points

2024-08-01T06:31:29.2033333+00:00

@navba-MSFT
hi my durations are not accurate now i dont mind having mstts:audioduration in separate voice tag but i need those durations be accurate beacuse now audio is longer but problem is with breaks how can i use them so i dont have to dedicate separate tag just for break
navba-MSFT 23,095 Reputation points Microsoft Employee

2024-08-01T06:31:53.9333333+00:00

@Jakub Chudiak Please let me know how you are using this SSML content ? Is it through Speech Studio or SDK or REST API ?

.

Did you try updating the rate attribute to fast or x-fast ?

Awaiting your reply.
Jakub Chudiak 0 Reputation points

2024-08-01T06:34:08.7033333+00:00

@navba-MSFT i am trying it in postman yes i tried those rates but i want to rely on audio durations to get accurate timing

navba-MSFT 23,095 Microsoft Employee

@Jakub Chudiak Could you please test the below from Speech Studio instead of postman ?

<?xml version="1.0" encoding="UTF-8"?>

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">

    <voice name="en-GB-RyanNeural">

        <break time="1230ms"/>

        <mstts:audioduration value="2440ms">This is a jammy Margarita.</mstts:audioduration>

        <break time="1850ms"/>

        <mstts:audioduration value="4600ms">Let me show you a little secret ingredient. 2 teaspoons of your favourite jam.</mstts:audioduration>

        <break time="4790ms"/>

    </voice>

</speak>

Jakub Chudiak 0 Reputation points

2024-08-01T06:50:45.64+00:00

@navba-MSFT which one is it? plus i need it to be accurate beacuse i will be using it with api and when i tried it in postman it didnt work correctly
Jakub Chudiak 0 Reputation points

2024-08-01T06:54:04.25+00:00

@navba-MSFT


Unknown tag: mstts:audioduration. It's an unsupported tuning tag, but it will take effect when you generate the audio.
but i need it to work with api and it doesnt audio is 8 seconds long
Jakub Chudiak 0 Reputation points

2024-08-01T06:55:14.2166667+00:00

@navba-MSFT


Unknown tag: mstts:audioduration. It's an unsupported tuning tag, but it will take effect when you generate the audio.
and when i try it in postman i get 8 second audio meaning it doesnt work plus i need it to use with api
Jakub Chudiak 0 Reputation points

2024-08-01T06:56:10.1966667+00:00

@navba-MSFT hello in speech studio it didnt work, plus i need to make it work for api so i wont be using speech studio for it
jakub chudiak 20 Reputation points

2024-08-01T07:06:04.6766667+00:00

@navba-MSFT how in what do you test it beacuse in speech audio i got unknown tag i checked doc and audio duration doesnt support text in it
Jakub Chudiak 0 Reputation points

2024-08-02T11:08:32.13+00:00

anyone any idea how to accomplish it
navba-MSFT 23,095 Reputation points Microsoft Employee

2024-08-05T02:35:45.3266667+00:00

@Jakub Chudiak Apologies for the late reply. I have reached out to the Product Owners. I will keep you posted, once I hear from them.
Jakub Chudiak 0 Reputation points

2024-08-06T06:25:47.03+00:00

@navba-MSFT ok thank you
navba-MSFT 23,095 Reputation points Microsoft Employee

2024-08-07T03:59:23.4333333+00:00
@Jakub Chudiak Please refer the below documentation:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#adjust-the-audio-duration

The doc says audioduration is applied for whole content with voice tag, which means this is a SSML level tag.

And there are constrains of the duration length, which will be 0.5 to 2 times the rate of the original audio.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-EN"> <voice name="en-GB-RyanNeural"> Let me show you a little secret ingredient. 2 teaspoons of your favourite jam. <break time="4790ms"/> </voice> </speak>

Use this as an example, original audio length is 10s (without audioduration). while audio duration is set to 4600ms, which exceeds it's capability and will use max 2 times rate to generate the audio, which will get output of 5s. This 5s output will also cut the break time to half.
Jakub Chudiak 0 Reputation points

2024-08-07T12:31:41.1033333+00:00

@navba-MSFT Hello i am sorry but this doesnt fix that issue my issue is with break tags inside separate voice tags plus i aware of limits of those durations but they are essential for me

1 answer

Sina Salam 9,406 Reputation points

2024-07-31T11:38:20.6333333+00:00
Hello Jakub Chudiak,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

Problem

I understand that you would like to add break times is Azure SSML and still keep duration of your text.

Solution

To manage breaks in SSML without dedicating a separate <voice> tag for each, you can consolidate the text and use the <break> tag within a single <voice> block. This approach allows you to specify pauses while keeping the same voice settings throughout the text.

<?xml version="1.0" encoding="UTF-8"?> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-EN"> <voice name="en-GB-RyanNeural"> <break time="1230ms"/> <mstts:audioduration value="2440ms"/>This is a jammy Margarita. <break time="1850ms"/> <mstts:audioduration value="4600ms"/>Let me show you a little secret ingredient. 2 teaspoons of your favourite jam. <break time="4790ms"/> </voice> </speak>

By using a single <voice> tag, you avoid the limit of 50 <voice> tags and still maintain control over the pacing and pauses in the speech synthesis output. You can check the links for more reading: https://www.w3.org/TR/speech-synthesis11/#S3 and https://www.w3.org/TR/speech-synthesis/#Voice and https://www.w3.org/TR/speech-synthesis/#S3.2.3

Accept Answer

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

** Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful ** so that others in the community facing similar issues can easily find the solution.

Best Regards,

Sina Salam
Please sign in to rate this answer.
Jakub Chudiak 0 Reputation points

2024-07-31T11:43:02.4966667+00:00

hi this doesnt work for me resulted audio is 8 second long and it pauses are shorter and my duration seems to be ignored

<?xml version="1.0" encoding="UTF-8"?> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-EN"> <voice name="en-GB-RyanNeural"> <break time="1230ms"/> <mstts:audioduration value="2440ms"/>This is a jammy Margarita. <break time="1850ms"/> <mstts:audioduration value="4600ms"/>Let me show you a little secret ingredient. 2 teaspoons of your favourite jam. <break time="4790ms"/> </voice> </speak>
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

How to add break times is Azure SSML

1 answer

Problem

Solution

Accept Answer

Your answer