Pronunciation with SSML
You can use Speech Synthesis Markup Language (SSML) with text to speech to specify how the speech is pronounced. For example, you can use SSML with phonemes and a custom lexicon to improve pronunciation. You can also use SSML to define how a word or mathematical expression is pronounced.
Refer to the following sections for details about how to use SSML elements to improve pronunciation. For more information about SSML syntax, see SSML document structure and events.
phoneme element
The phoneme
element is used for phonetic pronunciation in SSML documents. Always provide human-readable speech as a fallback.
Phonetic alphabets are composed of phones, which are made up of letters, numbers, or characters, sometimes in combination. Each phone describes a unique sound of speech. The phonetic alphabet is in contrast to the Latin alphabet, where any letter might represent multiple spoken sounds. Consider the different en-US
pronunciations of the letter "c" in the words "candy" and "cease" or the different pronunciations of the letter combination "th" in the words "thing" and "those."
Note
For a list of locales that support phonemes, see footnotes in the language support table.
Usage of the phoneme
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
alphabet |
The phonetic alphabet to use when you synthesize the pronunciation of the string in the ph attribute. The string that specifies the alphabet must be specified in lowercase letters. The following options are the possible alphabets that you can specify:
The alphabet applies only to the phoneme in the element. |
Optional |
ph |
A string containing phones that specify the pronunciation of the word in the phoneme element. If the specified string contains unrecognized phones, text to speech rejects the entire SSML document and produces none of the speech output specified in the document.For ipa , to stress one syllable by placing stress symbol before this syllable, you need to mark all syllables for the word. Or else, the syllable before this stress symbol is stressed. For sapi , if you want to stress one syllable, you need to place the stress symbol after this syllable, whether or not all syllables of the word are marked. |
Required |
phoneme examples
The supported values for attributes of the phoneme
element were described previously. In the first two examples, the values of ph="tə.ˈmeɪ.toʊ"
or ph="təmeɪˈtoʊ"
are specified to stress the syllable meɪ
.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<phoneme alphabet="ipa" ph="tə.ˈmeɪ.toʊ"> tomato </phoneme>
</voice>
</speak>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<phoneme alphabet="ipa" ph="təmeɪˈtoʊ"> tomato </phoneme>
</voice>
</speak>
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<phoneme alphabet="sapi" ph="iy eh n y uw eh s"> en-US </phoneme>
</voice>
</speak>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaNeural">
<phoneme alphabet='x-sampa' ph='he."lou'>hello</phoneme>
</voice>
</speak>
Custom lexicon
You can define how single entities (such as company, a medical term, or an emoji) are read in SSML by using the phoneme and sub elements. To define how multiple entities are read, create an XML structured custom lexicon file. Then you upload the custom lexicon XML file and reference it with the SSML lexicon
element.
Note
For a list of locales that support custom lexicon, see footnotes in the language support table.
The lexicon
element is not supported by the Long Audio API. For long-form text to speech, use the batch synthesis API (Preview) instead.
Usage of the lexicon
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
uri |
The URI of the publicly accessible custom lexicon XML file with either the .xml or .pls file extension. Using Azure Blob Storage is recommended but not required. For more information about the custom lexicon file, see Pronunciation Lexicon Specification (PLS) Version 1.0. |
Required |
Custom lexicon examples
The supported values for attributes of the lexicon
element were described previously.
After you publish your custom lexicon, you can reference it from your SSML. The following SSML example references a custom lexicon that was uploaded to https://www.example.com/customlexicon.xml
. We support lexicon URLs from Azure Blob Storage, Advanced Media Services (AMS) Storage, and GitHub. However, note that other public URLs may not be compatible.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts"
xml:lang="en-US">
<voice name="en-US-AvaNeural">
<lexicon uri="https://www.example.com/customlexicon.xml"/>
BTW, we will be there probably at 8:00 tomorrow morning.
Could you help leave a message to Robert Benigni for me?
</voice>
</speak>
Custom lexicon file
To define how multiple entities are read, you can define them in a custom lexicon XML file with either the .xml
or .pls
file extension.
Note
The custom lexicon file is a valid XML document, but it cannot be used as an SSML document.
Here are some limitations of the custom lexicon file:
- File size: The custom lexicon file size is limited to a maximum of 100 KB. If the file size exceeds the 100-KB limit, the synthesis request fails. You can split your lexicon into multiple lexicons and include them in SSML if the file size exceeds 100 KB.
- Lexicon cache refresh: The custom lexicon is cached with the URI as the key on text to speech when it's first loaded. The lexicon with the same URI isn't reloaded within 15 minutes, so the custom lexicon change needs to wait 15 minutes at the most to take effect.
The supported elements and attributes of a custom lexicon XML file are described in the Pronunciation Lexicon Specification (PLS) Version 1.0. Here are some examples of the supported elements and attributes:
- The
lexicon
element contains at least onelexeme
element. Lexicon contains the necessaryxml:lang
attribute to indicate which locale it should be applied for. One custom lexicon is limited to one locale by design, so if you apply it for a different locale, it doesn't work. Thelexicon
element also has analphabet
attribute to indicate the alphabet used in the lexicon. The possible values areipa
andx-microsoft-sapi
. - Each
lexeme
element contains at least onegrapheme
element and one or moregrapheme
,alias
, andphoneme
elements. Thelexeme
element is case sensitive in the custom lexicon. For example, if you only provide a phoneme for thelexeme
"Hello", it doesn't work for thelexeme
"hello". - The
grapheme
element contains text that describes the orthography. - The
alias
elements are used to indicate the pronunciation of an acronym or an abbreviated term. - The
phoneme
element provides text that describes how thelexeme
is pronounced. The syllable boundary is '.' in the IPA alphabet. Thephoneme
element can't contain white space when you use the IPA alphabet. - When the
alias
andphoneme
elements are provided with the samegrapheme
element,alias
has higher priority.
Microsoft provides a validation tool for the custom lexicon that helps you find errors (with detailed error messages) in the custom lexicon file. Using the tool is recommended before you use the custom lexicon XML file in production with the Speech service.
Custom lexicon file examples
The following XML example (not SSML) would be contained in a custom lexicon .xml
file. When you use this custom lexicon, "BTW" is read as "By the way." "Benigni" is read with the provided IPA "bɛˈniːnji."
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>BTW</grapheme>
<alias>By the way</alias>
</lexeme>
<lexeme>
<grapheme>Benigni</grapheme>
<phoneme>bɛˈniːnji</phoneme>
</lexeme>
<lexeme>
<grapheme>😀</grapheme>
<alias>test emoji</alias>
</lexeme>
</lexicon>
You can't directly set the pronunciation of a phrase by using the custom lexicon. If you need to set the pronunciation for an acronym or an abbreviated term, first provide an alias
, and then associate the phoneme
with that alias
. For example:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>Scotland MV</grapheme>
<alias>ScotlandMV</alias>
</lexeme>
<lexeme>
<grapheme>ScotlandMV</grapheme>
<phoneme>ˈskɒtlənd.ˈmiːdiəm.weɪv</phoneme>
</lexeme>
</lexicon>
You could also directly provide your expected alias
for the acronym or abbreviated term. For example:
<lexeme>
<grapheme>Scotland MV</grapheme>
<alias>Scotland Media Wave</alias>
</lexeme>
The preceding custom lexicon XML file examples use the IPA alphabet, which is also known as the IPA phone set. We suggest that you use the IPA because it's the international standard. For some IPA characters, they're the "precomposed" and "decomposed" version when they're being represented with Unicode. The custom lexicon only supports the decomposed Unicode.
The Speech service defines a phonetic set for these locales: en-US
, fr-FR
, de-DE
, es-ES
, ja-JP
, zh-CN
, zh-HK
, and zh-TW
. For more information on the detailed Speech service phonetic alphabet, see the Speech service phonetic sets.
You can use the x-microsoft-sapi
as the value for the alphabet
attribute with custom lexicons as demonstrated here:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="x-microsoft-sapi" xml:lang="en-US">
<lexeme>
<grapheme>BTW</grapheme>
<alias> By the way </alias>
</lexeme>
<lexeme>
<grapheme> Benigni </grapheme>
<phoneme> b eh 1 - n iy - n y iy </phoneme>
</lexeme>
</lexicon>
say-as element
The say-as
element indicates the content type, such as number or date, of the element's text. This element provides guidance to the speech synthesis engine about how to pronounce the text.
Usage of the say-as
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
interpret-as |
Indicates the content type of an element's text. For a list of types, see the following table. | Required |
format |
Provides additional information about the precise formatting of the element's text for content types that might have ambiguous formats. SSML defines formats for content types that use them. See the following table. | Optional |
detail |
Indicates the level of detail to be spoken. For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. There are no standard values defined for detail . |
Optional |
The following content types are supported for the interpret-as
and format
attributes. Include the format
attribute only if format
column isn't empty in this table.
Note
The characters
and spell-out
values for the interpret-as
attribute are supported for all text to speech locales.
Other interpret-as
attribute values are supported for all locales of the following languages: Arabic, Catalan, Chinese, Danish, Dutch, English, French, Finnish, German, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, and Swedish.
interpret-as | format | Interpretation |
---|---|---|
characters , spell-out |
The text is spoken as individual letters (spelled out). The speech synthesis engine pronounces:<say-as interpret-as="characters">test</say-as> As "T E S T." |
|
cardinal , number |
None | The text is spoken as a cardinal number. The speech synthesis engine pronounces:There are <say-as interpret-as="cardinal">10</say-as> options As "There are ten options." |
ordinal |
None | The text is spoken as an ordinal number. The speech synthesis engine pronounces:Select the <say-as interpret-as="ordinal">3rd</say-as> option As "Select the third option." |
number_digit |
None | The text is spoken as a sequence of individual digits. The speech synthesis engine pronounces:<say-as interpret-as="number_digit">123456789</say-as> As "1 2 3 4 5 6 7 8 9." |
fraction |
None | The text is spoken as a fractional number. The speech synthesis engine pronounces:<say-as interpret-as="fraction">3/8</say-as> of an inch As "three eighths of an inch." |
date |
dmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y | The text is spoken as a date. The format attribute specifies the date's format (d=day, m=month, and y=year). The speech synthesis engine pronounces:Today is <say-as interpret-as="date">10-12-2016</say-as> As "Today is October twelfth two thousand sixteen." Pronounces: Today is <say-as interpret-as="date" format="dmy">10-12-2016</say-as> As "Today is December tenth two thousand sixteen." |
time |
hms12, hms24 | The text is spoken as a time. The format attribute specifies whether the time is specified by using a 12-hour clock (hms12) or a 24-hour clock (hms24). Use a colon to separate numbers representing hours, minutes, and seconds. Here are some valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45. The speech synthesis engine pronounces:The train departs at <say-as interpret-as="time" format="hms12">4:00am</say-as> As "The train departs at four A M." |
duration |
hms, hm, ms | The text is spoken as a duration. The format attribute specifies the duration's format (h=hour, m=minute, and s=second). The speech synthesis engine pronounces:<say-as interpret-as="duration">01:18:30</say-as> As "one hour eighteen minutes and thirty seconds". Pronounces: <say-as interpret-as="duration" format="ms">01:18</say-as> As "one minute and eighteen seconds". This tag is only supported on English and Spanish. |
telephone |
None | The text is spoken as a telephone number. The speech synthesis engine pronounces:The number is <say-as interpret-as="telephone">(888) 555-1212</say-as> As "My number is area code eight eight eight five five five one two one two." |
currency |
None | The text is spoken as a currency. The speech synthesis engine pronounces:<say-as interpret-as="currency">99.9 USD</say-as> As "ninety-nine US dollars and ninety cents." |
address |
None | The text is spoken as an address. The speech synthesis engine pronounces:I'm at <say-as interpret-as="address">150th CT NE, Redmond, WA</say-as> As "I'm at 150th Court Northeast Redmond Washington." |
name |
None | The text is spoken as a person's name. The speech synthesis engine pronounces:<say-as interpret-as="name">ED</say-as> As [æd]. In Chinese names, some characters pronounce differently when they appear in a family name. For example, the speech synthesis engine says 仇 in <say-as interpret-as="name">仇先生</say-as> As [qiú] instead of [chóu]. |
say-as examples
The supported values for attributes of the say-as
element were described previously.
The speech synthesis engine speaks the following example as "Your first request was for one room on October nineteenth twenty ten with early arrival at twelve thirty five PM."
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaMultilingualNeural">
<p>
Your <say-as interpret-as="ordinal"> 1st </say-as> request was for <say-as interpret-as="cardinal"> 1 </say-as> room
on <say-as interpret-as="date" format="mdy"> 10/19/2010 </say-as>, with early arrival at <say-as interpret-as="time" format="hms12"> 12:35pm </say-as>.
</p>
</voice>
</speak>
sub element
Use the sub
element to indicate that the alias attribute's text value should be pronounced instead of the element's enclosed text. In this way, the SSML contains both a spoken and written form.
Usage of the sub
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
alias |
The text value that should be pronounced instead of the element's enclosed text. | Required |
sub examples
The supported values for attributes of the sub
element were described previously.
The speech synthesis engine speaks the following example as "World Wide Web Consortium."
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AvaMultilingualNeural">
<sub alias="World Wide Web Consortium">W3C</sub>
</voice>
</speak>
Pronunciation with MathML
The Mathematical Markup Language (MathML) is an XML-compliant markup language that describes mathematical content and structure. The Speech service can use the MathML as input text to properly pronounce mathematical notations in the output audio.
Note
The MathML elements (tags) are currently supported in the following locales: de-DE
, en-AU
, en-GB
, en-US
, es-ES
, es-MX
, fr-CA
, fr-FR
, it-IT
, ja-JP
, ko-KR
, pt-BR
, and zh-CN
.
All elements from the MathML 2.0 and MathML 3.0 specifications are supported, except the MathML 3.0 Elementary Math elements.
Take note of these MathML elements and attributes:
- The
xmlns
attribute in<math xmlns="http://www.w3.org/1998/Math/MathML">
is optional. - The
semantics
,annotation
, andannotation-xml
elements don't output speech, so they're ignored. - If an element isn't recognized, it's ignored, and the child elements within it are still processed.
The XML syntax doesn't support the MathML entities, so you must use the corresponding unicode characters to represent the entities, for example, the entity ©
should be represented by its unicode characters ©
, otherwise an error occurs.
MathML examples
The text to speech output for this example is "a squared plus b squared equals c squared".
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='en-US-JennyNeural'>
<math xmlns='http://www.w3.org/1998/Math/MathML'>
<msup>
<mi>a</mi>
<mn>2</mn>
</msup>
<mo>+</mo>
<msup>
<mi>b</mi>
<mn>2</mn>
</msup>
<mo>=</mo>
<msup>
<mi>c</mi>
<mn>2</mn>
</msup>
</math>
</voice>
</speak>