Azure TTS: Getting non speech audio bytes at beginning and ending of TTS speech

Tom Westrick 20 Reputation points
2025-03-06T21:22:56.8833333+00:00

We use Azure's Rest API with the TTS service to generate audio for one of our products. From our logs, it seems starting on February 28, 2025, we started getting audio back with non-speech bytes (two audio blips) at the beginning and end of the audio generated when using the voice zh-CN-XiaochenMultilingualNeural in English. I have an example mp3 file but it seems we cannot upload audio files here.

Here is an example request to replicate the issue:

POST /cognitiveservices/v1 HTTP/1.1
Host: eastus.tts.speech.microsoft.com
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3
Content-Length: 309

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice xml:lang="en-US" name="zh-CN-XiaochenMultilingualNeural">
        <lang xml:lang="en-US">
            Good morning, this is for testing.
        </lang>
    </voice>
</speak>

From lots of trial and error, it seems removing all line breaks and extra white space in the XML, the non-speech bytes don't get generated. This seems like a workaround and not a permanent fix.

This works as expected:

POST /cognitiveservices/v1 HTTP/1.1
Host: eastus.tts.speech.microsoft.com
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3
Content-Length: 267

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"><voice xml:lang="en-US" name="zh-CN-XiaochenMultilingualNeural"><lang xml:lang="en-US">Good morning, this is for testing.</lang></voice></speak>

Like I said, this just started happening and all other Azure voices we use seem to work just fine.

My question is, can this be confirmed a bug with the specific voice? And can my fix be considered a permanent solution or is it random that it fixes the issue?

The Docs show requests being made with the XML having line breaks and white space.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,061 questions
{count} votes

Accepted answer
  1. santoshkc 15,325 Reputation points Microsoft External Staff Moderator
    2025-03-12T11:16:35.5966667+00:00

    HI @Tom Westrick

    Apologies for the delay in response and thank you for reporting this issue. The behavior you’re experiencing with the zh-CN-XiaochenMultilingualNeural voice seems to be related to how the SSML is processed. Removing extra white spaces and line breaks is a valid workaround, as SSML formatting can sometimes introduce unintended pauses or artifacts in certain voices.

    To ensure consistent results, we recommend keeping SSML formatting minimal by writing it as a single line without unnecessary spaces or breaks. Additionally, you can try adjusting the <silence> tag with mstts:silence to manually control pauses. If switching to another multilingual voice is feasible for your use case, that may also be worth considering.

    I hope you understand! Thank you.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Tom Westrick 20 Reputation points
    2025-03-17T14:49:37.94+00:00

    Removing extra white spaces and line breaks is a valid workaround, as SSML formatting can sometimes introduce unintended pauses or artifacts in certain voices.

    Thanks @santoshkc for confirming.

    We are sticking with removing all extra white space and line breaks for now as it does work.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.