SSML document structure and events
The Speech Synthesis Markup Language (SSML) with input text determines the structure, content, and other characteristics of the text to speech output. For example, you can use SSML to define a paragraph, a sentence, a break or a pause, or silence. You can wrap text with event tags such as bookmark or viseme that can be processed later by your application.
Refer to the sections below for details about how to structure elements in the SSML document.
Note
In addition to Azure AI Speech neural (non HD) voices, you can also use Azure AI Speech high definition (HD) voices and Azure OpenAI neural (HD and non HD) voices. The HD voices provide a higher quality for more versatile scenarios.
Some voices don't support all Speech Synthesis Markup Language (SSML) tags. This includes neural text to speech HD voices, personal voices, and embedded voices.
- For Azure AI Speech high definition (HD) voices, check the SSML support here.
- For personal voice, you can find the SSML support here.
- For embedded voices, check the SSML support here.
Document structure
The Speech service implementation of SSML is based on the World Wide Web Consortium's Speech Synthesis Markup Language Version 1.0. The elements supported by the Speech can differ from the W3C standard.
Each SSML document is created with SSML elements or tags. These elements are used to adjust the voice, style, pitch, prosody, volume, and more.
Here's a subset of the basic structure and syntax of an SSML document:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="string">
<mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>
<voice name="string" effect="string">
<audio src="string"></audio>
<bookmark mark="string"/>
<break strength="string" time="string" />
<emphasis level="value"></emphasis>
<lang xml:lang="string"></lang>
<lexicon uri="string"/>
<math xmlns="http://www.w3.org/1998/Math/MathML"></math>
<mstts:audioduration value="string"/>
<mstts:ttsembedding speakerProfileId="string"></mstts:ttsembedding>
<mstts:express-as style="string" styledegree="value" role="string"></mstts:express-as>
<mstts:silence type="string" value="string"/>
<mstts:viseme type="string"/>
<p></p>
<phoneme alphabet="string" ph="string"></phoneme>
<prosody pitch="value" contour="value" range="value" rate="value" volume="value"></prosody>
<s></s>
<say-as interpret-as="string" format="string" detail="string"></say-as>
<sub alias="string"></sub>
</voice>
</speak>
Some examples of contents that are allowed in each element are described in the following list:
audio
: The body of theaudio
element can contain plain text or SSML markup that's spoken if the audio file is unavailable or unplayable. Theaudio
element can also contain text and the following elements:audio
,break
,p
,s
,phoneme
,prosody
,say-as
, andsub
.bookmark
: This element can't contain text or any other elements.break
: This element can't contain text or any other elements.emphasis
: This element can contain text and the following elements:audio
,break
,emphasis
,lang
,phoneme
,prosody
,say-as
, andsub
.lang
: This element can contain all other elements exceptmstts:backgroundaudio
,voice
, andspeak
.lexicon
: This element can't contain text or any other elements.math
: This element can only contain text and MathML elements.mstts:audioduration
: This element can't contain text or any other elements.mstts:backgroundaudio
: This element can't contain text or any other elements.mstts:embedding
: This element can contain text and the following elements:audio
,break
,emphasis
,lang
,phoneme
,prosody
,say-as
, andsub
.mstts:express-as
: This element can contain text and the following elements:audio
,break
,emphasis
,lang
,phoneme
,prosody
,say-as
, andsub
.mstts:silence
: This element can't contain text or any other elements.mstts:viseme
: This element can't contain text or any other elements.p
: This element can contain text and the following elements:audio
,break
,phoneme
,prosody
,say-as
,sub
,mstts:express-as
, ands
.phoneme
: This element can only contain text and no other elements.prosody
: This element can contain text and the following elements:audio
,break
,p
,phoneme
,prosody
,say-as
,sub
, ands
.s
: This element can contain text and the following elements:audio
,break
,phoneme
,prosody
,say-as
,mstts:express-as
, andsub
.say-as
: This element can only contain text and no other elements.sub
: This element can only contain text and no other elements.speak
: The root element of an SSML document. This element can contain the following elements:mstts:backgroundaudio
andvoice
.voice
: This element can contain all other elements exceptmstts:backgroundaudio
andspeak
.
The Speech service automatically handles punctuation as appropriate, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark.
Special characters
To use the characters &
, <
, and >
within the SSML element's value or text, you must use the entity format. Specifically you must use &
in place of &
, use <
in place of <
, and use >
in place of >
. Otherwise the SSML isn't parsed correctly.
For example, specify green & yellow
instead of green & yellow
. The following SSML is parsed as expected:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
My favorite colors are green & yellow.
</voice>
</speak>
Special characters such as quotation marks, apostrophes, and brackets, must be escaped. For more information, see Extensible Markup Language (XML) 1.0: Appendix D.
Double or single quotation marks must enclose the attribute values. For example, <prosody volume="90">
and <prosody volume='90'>
are well-formed, valid elements, but <prosody volume=90>
isn't recognized.
Speak root element
The speak
element contains information such as version, language, and the markup vocabulary definition. The speak
element is the root element that's required for all SSML documents. You must specify the default language within the speak
element, whether or not the language is adjusted elsewhere such as within the lang
element.
Here's the syntax for the speak
element:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string"></speak>
Attribute | Description | Required or optional |
---|---|---|
version |
Indicates the version of the SSML specification used to interpret the document markup. The current version is "1.0". | Required |
xml:lang |
The language of the root document. The value can contain a language code such as en (English), or a locale such as en-US (English - United States). |
Required |
xmlns |
The URI to the document that defines the markup vocabulary (the element types and attribute names) of the SSML document. The current URI is "http://www.w3.org/2001/10/synthesis". | Required |
The speak
element must contain at least one voice element.
speak examples
The supported values for attributes of the speak
element were described previously.
Single voice example
This example uses the en-US-AvaNeural
voice. For more examples, see voice examples.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
This is the text that is spoken.
</voice>
</speak>
Add a break
Use the break
element to override the default behavior of breaks or pauses between words. Otherwise the Speech service automatically inserts pauses.
Usage of the break
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
strength |
The relative duration of a pause by using one of the following values:
|
Optional |
time |
The absolute duration of a pause in seconds (such as 2s ) or milliseconds (such as 500ms ). Valid values range from 0 to 20000 milliseconds. If you set a value greater than the supported maximum, the service uses 20000ms . If the time attribute is set, the strength attribute is ignored. |
Optional |
Here are more details about the strength
attribute.
Strength | Relative duration |
---|---|
X-weak | 250 ms |
Weak | 500 ms |
Medium | 750 ms |
Strong | 1,000 ms |
X-strong | 1,250 ms |
Break examples
The supported values for attributes of the break
element were described previously. The following three ways all add 750 ms breaks.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
Welcome <break /> to text to speech.
Welcome <break strength="medium" /> to text to speech.
Welcome <break time="750ms" /> to text to speech.
</voice>
</speak>
Add silence
Use the mstts:silence
element to insert pauses before or after text, or between two adjacent sentences.
One of the differences between mstts:silence
and break
is that a break
element can be inserted anywhere in the text. Silence only works at the beginning or end of input text or at the boundary of two adjacent sentences.
The silence setting is applied to all input text within its enclosing voice
element. To reset or change the silence setting again, you must use a new voice
element with either the same voice or a different voice.
Usage of the mstts:silence
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
type |
Specifies where and how to add silence. The following silence types are supported:
An absolute silence type (with the -exact suffix) replaces any otherwise natural leading or trailing silence. Absolute silence types take precedence over the corresponding non-absolute type. For example, if you set both Leading and Leading-exact types, the Leading-exact type takes effect. The WordBoundary event takes precedence over punctuation-related silence settings including Comma-exact , Semicolon-exact , or Enumerationcomma-exact . When you use both the WordBoundary event and punctuation-related silence settings, the punctuation-related silence settings don't take effect. |
Required |
Value |
The duration of a pause in seconds (such as 2s ) or milliseconds (such as 500ms ). Valid values range from 0 to 20000 milliseconds. If you set a value greater than the supported maximum, the service uses 20000ms . |
Required |
mstts silence examples
The supported values for attributes of the mstts:silence
element were described previously.
In this example, mstts:silence
is used to add 200 ms of silence between two sentences.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<mstts:silence type="Sentenceboundary" value="200ms"/>
If we're home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
A good place to start is by trying out the slew of educational apps that are helping children stay happy and smash their schooling at the same time.
</voice>
</speak>
In this example, mstts:silence
is used to add 50 ms of silence at the comma, 100 ms of silence at the semicolon, and 150 ms of silence at the enumeration comma.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="zh-CN">
<voice name="zh-CN-YunxiNeural">
<mstts:silence type="comma-exact" value="50ms"/><mstts:silence type="semicolon-exact" value="100ms"/><mstts:silence type="enumerationcomma-exact" value="150ms"/>你好呀,云希、晓晓;你好呀。
</voice>
</speak>
Specify paragraphs and sentences
The p
and s
elements are used to denote paragraphs and sentences, respectively. In the absence of these elements, the Speech service automatically determines the structure of the SSML document.
Paragraph and sentence examples
The following example defines two paragraphs that each contain sentences. In the second paragraph, the Speech service automatically determines the sentence structure, since they aren't defined in the SSML document.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<p>
<s>Introducing the sentence element.</s>
<s>Used to mark individual sentences.</s>
</p>
<p>
Another simple paragraph.
Sentence structure in this paragraph is not explicitly marked.
</p>
</voice>
</speak>
Bookmark element
You can use the bookmark
element in SSML to reference a specific location in the text or tag sequence. Then you use the Speech SDK and subscribe to the BookmarkReached
event to get the offset of each marker in the audio stream. The bookmark
element isn't spoken. For more information, see Subscribe to synthesizer events.
Usage of the bookmark
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
mark |
The reference text of the bookmark element. |
Required |
Bookmark examples
The supported values for attributes of the bookmark
element were described previously.
As an example, you might want to know the time offset of each flower word in the following snippet:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
We are selling <bookmark mark='flower_1'/>roses and <bookmark mark='flower_2'/>daisies.
</voice>
</speak>
Viseme element
A viseme is the visual description of a phoneme in spoken language. It defines the position of the face and mouth while a person is speaking. You can use the mstts:viseme
element in SSML to request viseme output. For more information, see Get facial position with viseme.
The viseme setting is applied to all input text within its enclosing voice
element. To reset or change the viseme setting again, you must use a new voice
element with either the same voice or a different voice.
Usage of the viseme
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
type |
The type of viseme output.
|
Required |
Note
Currently, redlips_front
only supports neural voices in en-US
locale, and FacialExpression
supports neural voices in en-US
and zh-CN
locales.
Viseme examples
The supported values for attributes of the viseme
element were described previously.
This SSML snippet illustrates how to request blend shapes with your synthesized speech.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<mstts:viseme type="FacialExpression"/>
Rainbow has seven colors: Red, orange, yellow, green, blue, indigo, and violet.
</voice>
</speak>