Text to Speech does not set right pitch if two pitches are used.

Sabir Ahmed 11 Reputation points
2022-11-20T08:54:35.66+00:00

Hey team,

Sample SSML:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-SaraNeural">Much against his will Reddy obeyed. <prosody rate="default" pitch="28%" volume="default">“It isn’t the least bit of use,”</prosody> he grumbled, as he trotted towards the Big River. <prosody rate="default" pitch="28%" volume="default">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>  

I have a sentence with two parts of it set to pitch=28%.
The first part "It isn’t the least bit of use," sounds off more like pitch=8% even though its set to 28%
The second part "There won’t be anything there. It is just a waste of time." sounds correct at pitch=28%

Please note this is happening with all the voices and looks like a major bug.
It only happens when you set more than one sentence of the pitch.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,391 questions
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 41,961 Reputation points Microsoft Employee
    2022-12-01T13:14:31.867+00:00

    @Sabir Ahmed I see the same behavior with East US region too. After testing some scenarios, I think the pitch would apply correctly if you used a full stop instead of a comma in your original SSML and I think this is causing the API to interpret the sentence to be incomplete and not applying the rate on part of the sentence as this is only applicable at sentence level.

    This is the section where I changed a comma to a full stop.

    “It isn’t the least bit of use.”

    The entire SSML is:

    <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-SaraNeural">Much against his will Reddy obeyed. <prosody rate="default" pitch="28%" volume="default">“It isn’t the least bit of use.”</prosody> he grumbled, as he trotted towards the Big River. <prosody rate="default" pitch="28%" volume="default">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>  
    

    Which renders to following in ACC tool in speech studio.
    266179-image.png

    If an answer is helpful, please click on 130616-image.png or upvote 130671-image.png which might help other community members reading this thread.