Change word emphasis in a sentence with Azure Text-To-Speech SSML

fnx 1 Reputation point
2021-12-14T19:31:31.7+00:00

(I asked this question already on stackoverflow, but didn't get an answer: https://stackoverflow.com/questions/70165340/change-word-emphasis-in-a-sentence-with-azure-text-to-speech-ssml)

I want to change the emphasis to a different word and SSML supports the <emphasis> element, but with Azure TTS it seems like currently the only way is to use the <prosody> element.

I tried using the <prosody> element (with pitch parameter) to emphasize certain in words in sentences. Some it is ok sometimes it doesn't sound right (if I always use the same +30% value for example).

I pitch and contour:
First <prosody pitch="+18.00%">test</prosody> sentence.
Second <prosody contour="(20%, +31%) (43%, +11%)">test sentence</prosody>.
Third <prosody contour="(48%, +37%)">test</prosody> sentence.

I would find it much easier if there would be a <emphasis> element, because otherwise I have to get the exact values for each sentence. Maybe I'm missing something. If the <prosody> element is the only, what would be typical approach to emphasize a word in a sentence (pitch? contour? which percentage values?)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,069 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator
    2021-12-15T08:12:11.733+00:00

    @fnx The usage seems correct with respect to the attributes that are supported by Azure text to speech. I think you are not observing a noticeable difference because of the voice that may be used with your testing. I have tested this scenario with the same sentence in the speech studio audio content creation feature. Here are the results for the following SSML inputs.

    Normal:

    <speak  
    	xmlns="http://www.w3.org/2001/10/synthesis"  
    	xmlns:mstts="http://www.w3.org/2001/mstts"  
    	xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">  
    	<voice name="Microsoft Server Speech Text to Speech Voice (en-US, ChristopherNeural)">test sentence</voice>  
    </speak>  
    

    With intonation applied:

    <speak  
    	xmlns="http://www.w3.org/2001/10/synthesis"  
    	xmlns:mstts="http://www.w3.org/2001/mstts"  
    	xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">  
    	<voice name="Microsoft Server Speech Text to Speech Voice (en-US, ChristopherNeural)">  
    		<prosody contour="(1%, +85%)">test sentence</prosody>  
    	</voice>  
    </speak>  
    

    Could you please try the scenario with ChristopherNeural with the above SSML inputs? Using the speech studio you can set any of the input parameters by drag and drop instead of manually editing the SSML file. Due to limitation of files that can be attached on this thread, I am unable to upload the audio files for the above inputs. Thanks!!

    0 comments No comments

  2. fnx 1 Reputation point
    2021-12-15T08:54:33.56+00:00

    The issue is more that that each sentence has to be checked individually, if the prosody values work and I also don't know what values typically used for emphasis.
    That is why I was asking at the end of my post: I would find it much easier if there would be a <emphasis> element, because otherwise I have to get the exact values for each sentence. [...] If the <prosody> element is the only way, what would be typical approach to emphasize a word in a sentence (pitch? contour? which percentage values?)
    And also: are there any plans to add a a <emphasis> element?

    Btw, previous TTS versions did support an <emphasis> element:
    https://learn.microsoft.com/en-us/previous-versions/office/developer/communication-server-2007/bb801230(v=office.12)
    And it is also part of the SSML definition:
    https://www.w3.org/TR/speech-synthesis11/#S3.2.2

    0 comments No comments

  3. fnx 1 Reputation point
    2022-07-14T08:57:50.19+00:00

    Since the emphasis-element is now supported I was wondering what the plans are to support the element for more voices - other than English (currently only some English voices are supported).
    Link to documentation

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.