An Azure service that integrates speech processing into apps and services.
Hello @Pawel Pelka ,
Welcome to Microsoft Q&A .Thank you for reaching out to us.
peech synthesis uses Speech Synthesis Markup Language (SSML), which is an XML-based format that controls pronunciation, pacing, and structure of the generated audio. During this process, input text undergoes normalization and tokenization before being converted into speech. Some Unicode symbols do not consistently map to speech tokens, which can result in silent skipping or truncation without explicit API errors.
As asked -
- There is currently no officially published list of unsupported Unicode characters for text-to-speech or SSML. Symbol support depends on internal normalization and speech mapping, and may vary across different characters even within the same Unicode range.
- The following are the practices ensure stable behavior and prevent silent truncation. Preprocessing of input text is highly recommended for production scenarios. Converting symbols into speech-friendly text ensures consistent and complete audio output. Suggested approaches include:
- Replace symbolic characters with descriptive text
- ← → “left arrow”
- → → “right arrow”
- ↑ → “up arrow”
- ↓ → “down arrow”
- Replacing punctuation such as em dash with speech-compatible alternatives -Using comma or sentence separation or inserting controlled pauses using SSML
- Using SSML substitution when meaning must be preserved For example:
<sub alias="left arrow">←</sub> - Please prefer explicit pauses using SSML
<break>instead of relying on punctuation for timing control.SSML supports inserting pauses directly in the speech sequence
- Voice and locale differences can influence how text is normalized and spoken. SSML supports multiple voices, languages, and speech configurations, and each may process input text slightly differently. While some characters may work in specific combinations, the behavior is not deterministic enough to rely on for consistent output. Unsupported or weakly supported symbols may not always be gracefully skipped, and in some cases can disrupt the synthesis stream. Since such conditions do not always return explicit errors, they appear as audio gaps or truncation. To ensure reliable results, please consider the following approaches.
- Normalize input text (Unicode normalization before synthesis)
- Replace or map symbolic Unicode characters to natural language equivalents
- Use SSML elements such as
<break>for pauses and<sub>for substitutions - Avoid passing raw symbolic characters directly to speech synthesis pipelines
The following references might be helpful , please check them out
- Speech Synthesis Markup Language (SSML) overview - Speech service - Foundry Tools | Microsoft Learn
- Speech Synthesis Markup Language (SSML) document structure and events - Speech service - Azure AI services | Azure Docs
- Voice and sound with Speech Synthesis Markup Language (SSML) - Speech service - Azure AI services | Azure Docs
Thank you