Text to Speech does not set right pitch if two pitches are used.

Sabir Ahmed 11

Hey team,

Sample SSML:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-SaraNeural">Much against his will Reddy obeyed. <prosody rate="default" pitch="28%" volume="default">“It isn’t the least bit of use,”</prosody> he grumbled, as he trotted towards the Big River. <prosody rate="default" pitch="28%" volume="default">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>

I have a sentence with two parts of it set to pitch=28%.
The first part "It isn’t the least bit of use," sounds off more like pitch=8% even though its set to 28%
The second part "There won’t be anything there. It is just a waste of time." sounds correct at pitch=28%

Please note this is happening with all the voices and looks like a major bug.
It only happens when you set more than one sentence of the pitch.

romungi-MSFT 41,961 Reputation points Microsoft Employee

2022-11-21T09:11:39.84+00:00
@Sabir Ahmed I tried to use the speech studio with the following XML and I think the settings work as expected. Could you try the same and confirm?
As per documentation, Pitch changes can be applied at the sentence level. The pitch changes should be within 0.5 to 1.5 times the original audio.
I think the detection of sentence with other settings applied might have been an issue with the previous XML.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-SaraNeural">Much against his will Reddy obeyed. <prosody pitch="+28.00%">“It isn’t the least bit of use,” </prosody> he grumbled, as he trotted towards the Big River. <prosody pitch="+28.00%">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>
Sabir Ahmed 11 Reputation points

2022-11-21T09:59:38.187+00:00
Thank you for helping me out.

I can confirm that the voice for the lines
2. <prosody pitch="+28.00%">“It isn’t the least bit of use,” </prosody>

<prosody pitch="+28.00%">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>

Are completely different, I've tried this in speech studio as well as the API.

Here is the final output: https://fliki.ai/share/audio/microsoft-pitch-issue-637b4b26dde64016ddbd2a51

Please carefully observe the sentence 2. and 4. they should sound the same but are completely different.
romungi-MSFT 41,961 Reputation points Microsoft Employee

2022-11-21T10:17:02.497+00:00

The file that is shared does sound different for the sentences you mentioned. I have tried the XML shared above with my voice resource in westeurope and the pitch rate seems consistent. Please download the file from this link to check if this is the output you are expecting.
Sabir Ahmed 11 Reputation points

2022-11-21T10:24:30.077+00:00

Thanks for the quick response.

I can confirm I have tried this in East US and also the sample editor: https://azure.microsoft.com/en-us/products/cognitive-services/text-to-speech/#overview
Since a lot of the major voices and voice styles are available in East US it would be hard to move the resource to a different region.

Though will still try it out in West Europe.
Sabir Ahmed 11 Reputation points

2022-11-30T08:33:09.287+00:00

Hey @romungi-MSFT ,
Just wanted to follow up to see if this bug has been reported and if a ticket has been opened.
Recap the issue:
The there is multiple pitches used in a call then the result is not accurate this issue is present for all "East-US" region.

Hope this is resolved soon, as a lot of our critical users are being impacted from this issue.

Looking forward to the resolution.

1 answer

romungi-MSFT 41,961 Reputation points Microsoft Employee

2022-12-01T13:14:31.867+00:00
@Sabir Ahmed I see the same behavior with East US region too. After testing some scenarios, I think the pitch would apply correctly if you used a full stop instead of a comma in your original SSML and I think this is causing the API to interpret the sentence to be incomplete and not applying the rate on part of the sentence as this is only applicable at sentence level.

This is the section where I changed a comma to a full stop.

“It isn’t the least bit of use.”

The entire SSML is:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-SaraNeural">Much against his will Reddy obeyed. <prosody rate="default" pitch="28%" volume="default">“It isn’t the least bit of use.”</prosody> he grumbled, as he trotted towards the Big River. <prosody rate="default" pitch="28%" volume="default">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>

Which renders to following in ACC tool in speech studio.

If an answer is helpful, please click on or upvote which might help other community members reading this thread.
Please sign in to rate this answer.
Sabir Ahmed 11 Reputation points

2022-12-02T06:41:14.28+00:00

@romungi-MSFT : Thank you so much for helping out.
I can confirm that when replacing the "," comma with "." full-stop solves the issue.
Thank you for this, though since the original sentence had correct grammar was hoping this issue would not occur.
But looks like in order for the pitch to work correctly the end of the term should always end with a full-stop.

Hopefully this can be fixed and should not depend on the punctuation to apply the right pitch.

Thanks for your help again, really appreciate it.
Sign in to comment