Change word emphasis in a sentence with Azure Text-To-Speech SSML

Question

Change word emphasis in a sentence with Azure Text-To-Speech SSML

fnx 1

(I asked this question already on stackoverflow, but didn't get an answer: https://stackoverflow.com/questions/70165340/change-word-emphasis-in-a-sentence-with-azure-text-to-speech-ssml)

I want to change the emphasis to a different word and SSML supports the <emphasis> element, but with Azure TTS it seems like currently the only way is to use the <prosody> element.

I tried using the <prosody> element (with pitch parameter) to emphasize certain in words in sentences. Some it is ok sometimes it doesn't sound right (if I always use the same +30% value for example).

I pitch and contour:
First <prosody pitch="+18.00%">test</prosody> sentence.
Second <prosody contour="(20%, +31%) (43%, +11%)">test sentence</prosody>.
Third <prosody contour="(48%, +37%)">test</prosody> sentence.

I would find it much easier if there would be a <emphasis> element, because otherwise I have to get the exact values for each sentence. Maybe I'm missing something. If the <prosody> element is the only, what would be typical approach to emphasize a word in a sentence (pitch? contour? which percentage values?)

3 answers

Your answer

Answer 1

@fnx The usage seems correct with respect to the attributes that are supported by Azure text to speech. I think you are not observing a noticeable difference because of the voice that may be used with your testing. I have tested this scenario with the same sentence in the speech studio audio content creation feature. Here are the results for the following SSML inputs.

Normal:

<speak  
	xmlns="http://www.w3.org/2001/10/synthesis"  
	xmlns:mstts="http://www.w3.org/2001/mstts"  
	xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">  
	<voice name="Microsoft Server Speech Text to Speech Voice (en-US, ChristopherNeural)">test sentence</voice>  
</speak>

With intonation applied:

<speak  
	xmlns="http://www.w3.org/2001/10/synthesis"  
	xmlns:mstts="http://www.w3.org/2001/mstts"  
	xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">  
	<voice name="Microsoft Server Speech Text to Speech Voice (en-US, ChristopherNeural)">  
		<prosody contour="(1%, +85%)">test sentence</prosody>  
	</voice>  
</speak>

Could you please try the scenario with ChristopherNeural with the above SSML inputs? Using the speech studio you can set any of the input parameters by drag and drop instead of manually editing the SSML file. Due to limitation of files that can be attached on this thread, I am unable to upload the audio files for the above inputs. Thanks!!

Answer 2

The issue is more that that each sentence has to be checked individually, if the prosody values work and I also don't know what values typically used for emphasis.
That is why I was asking at the end of my post: I would find it much easier if there would be a <emphasis> element, because otherwise I have to get the exact values for each sentence. [...] If the <prosody> element is the only way, what would be typical approach to emphasize a word in a sentence (pitch? contour? which percentage values?)
And also: are there any plans to add a a <emphasis> element?

Btw, previous TTS versions did support an <emphasis> element:
https://learn.microsoft.com/en-us/previous-versions/office/developer/communication-server-2007/bb801230(v=office.12)
And it is also part of the SSML definition:
https://www.w3.org/TR/speech-synthesis11/#S3.2.2

Answer 3

fnx 1

Since the emphasis-element is now supported I was wondering what the plans are to support the element for more voices - other than English (currently only some English voices are supported).
Link to documentation

Share via

Change word emphasis in a sentence with Azure Text-To-Speech SSML

3 answers

Your answer