Azure TTS: Getting non speech audio bytes at beginning and ending of TTS speech

Question

Azure TTS: Getting non speech audio bytes at beginning and ending of TTS speech

Tom Westrick 20

We use Azure's Rest API with the TTS service to generate audio for one of our products. From our logs, it seems starting on February 28, 2025, we started getting audio back with non-speech bytes (two audio blips) at the beginning and end of the audio generated when using the voice zh-CN-XiaochenMultilingualNeural in English. I have an example mp3 file but it seems we cannot upload audio files here.

Here is an example request to replicate the issue:

POST /cognitiveservices/v1 HTTP/1.1
Host: eastus.tts.speech.microsoft.com
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3
Content-Length: 309

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice xml:lang="en-US" name="zh-CN-XiaochenMultilingualNeural">
        <lang xml:lang="en-US">
            Good morning, this is for testing.
        </lang>
    </voice>
</speak>

From lots of trial and error, it seems removing all line breaks and extra white space in the XML, the non-speech bytes don't get generated. This seems like a workaround and not a permanent fix.

This works as expected:

POST /cognitiveservices/v1 HTTP/1.1
Host: eastus.tts.speech.microsoft.com
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3
Content-Length: 267

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"><voice xml:lang="en-US" name="zh-CN-XiaochenMultilingualNeural"><lang xml:lang="en-US">Good morning, this is for testing.</lang></voice></speak>

Like I said, this just started happening and all other Azure voices we use seem to work just fine.

My question is, can this be confirmed a bug with the specific voice? And can my fix be considered a permanent solution or is it random that it fixes the issue?

The Docs show requests being made with the XML having line breaks and white space.

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2025-03-17T08:22:31.46+00:00

Hi @Tom Westrick,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you.

Accepted answer

1 additional answer

Your answer

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2025-03-17T08:22:31.46+00:00

Hi @Tom Westrick,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you.

Answer 1

HI @Tom Westrick

Apologies for the delay in response and thank you for reporting this issue. The behavior you’re experiencing with the zh-CN-XiaochenMultilingualNeural voice seems to be related to how the SSML is processed. Removing extra white spaces and line breaks is a valid workaround, as SSML formatting can sometimes introduce unintended pauses or artifacts in certain voices.

To ensure consistent results, we recommend keeping SSML formatting minimal by writing it as a single line without unnecessary spaces or breaks. Additionally, you can try adjusting the <silence> tag with mstts:silence to manually control pauses. If switching to another multilingual voice is feasible for your use case, that may also be worth considering.

I hope you understand! Thank you.

Answer 2

Tom Westrick 20

Removing extra white spaces and line breaks is a valid workaround, as SSML formatting can sometimes introduce unintended pauses or artifacts in certain voices.

Thanks @santoshkc for confirming.

We are sticking with removing all extra white space and line breaks for now as it does work.

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2025-03-18T13:27:22.23+00:00

Hi @Tom Westrick

I'm glad to hear that my response was helpful to you. And thanks for sharing the information, which might be beneficial to other community members reading this thread as solution. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", so I'll convert the previous response to an answer in case you'd like to accept the answer. This will help other users who may have a similar query find the solution more easily.

If you have any further questions or concerns, please don't hesitate to ask. We're always here to help.

Share via

Azure TTS: Getting non speech audio bytes at beginning and ending of TTS speech

1 additional answer

Your answer