Use Speech Synthesis Markup Language

Completed

While the Azure AI Speech SDK enables you to submit plain text to be synthesized into speech (for example, by using the SpeakTextAsync() method), the service also supports an XML-based syntax for describing characteristics of the speech you want to generate. This Speech Synthesis Markup Language (SSML) syntax offers greater control over how the spoken output sounds, enabling you to:

  • Specify a speaking style, such as "excited" or "cheerful" when using a neural voice.
  • Insert pauses or silence.
  • Specify phonemes (phonetic pronunciations), for example to pronounce the text "SQL" as "sequel".
  • Adjust the prosody of the voice (affecting the pitch, timbre, and speaking rate).
  • Use common "say-as" rules, for example to specify that a given string should be expressed as a date, time, telephone number, or other form.
  • Insert recorded speech or audio, for example to include a standard recorded message or simulate background noise.

For example, consider the following SSML:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
                     xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> 
    <voice name="en-US-AriaNeural"> 
        <mstts:express-as style="cheerful"> 
          I say tomato 
        </mstts:express-as> 
    </voice> 
    <voice name="en-US-GuyNeural"> 
        I say <phoneme alphabet="sapi" ph="t ao m ae t ow"> tomato </phoneme>. 
        <break strength="weak"/>Lets call the whole thing off! 
    </voice> 
</speak>

This SSML specifies a spoken dialog between two different neural voices, like this:

  • Ariana (cheerfully): "I say tomato:
  • Guy: "I say tomato (pronounced tom-ah-toe) ... Let's call the whole thing off!"

To submit an SSML description to the Speech service, you can use the SpeakSsmlAsync() method, like this:

speechSynthesizer.SpeakSsmlAsync(ssml_string);

For more information about SSML, see the Azure AI Speech SDK documentation.