SSML: Using <lang xml:lang=""> within a multilingual voice sounds incorrect / unlike when used with the language-specific voice

mkb13 11 Reputation points
2024-05-05T16:18:51.91+00:00

I am developing a TTS application that pronounces "nonsense words" with specific language pronunciations. For example, I am using Polish language voices to pronounce non-Polish words. If I use a Polish-specific language, I hear what I expect (the words read as a Polish speaker would read them). However, if I use an English multilingual voice and then add <lang xml:lang="pl-PL">...</lang> within the voice tag, it seems to be disregarding the lang tag and just guessing the language. Since the words are not actually Polish, the pronunciation is all over the place.

In short examples below, I use the word "mij" with Andrew and Ava and I specify that they should be pronounced as Polish (should be pronounced like English "me"). Instead, for Andrew it sounds like the system is guessing it is English, and pronounces it like English "Midge", and for Ava it seems to guess it is Spanish and pronounces the "j" like English "h".

Andrew:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">

<voice name="en-US-AndrewMultilingualNeural">

<lang xml:lang="pl-PL">mij</lang>

<lang xml:lang="en-US"> hello</lang>

</voice>

</speak>

Ava:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">

<voice name="en-US-AvaMultilingualNeural">

<lang xml:lang="pl-PL">mij</lang>

<lang xml:lang="en-US"> hello</lang>

</voice>

</speak>

I think I did the format correctly -- I don't see any difference between these and the example found in the "Lang examples" section at https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#multilingual-voices-with-the-lang-element. (I just included the English word "hello" to show how I am mixing the languages.)

How can I make the multilingual voice honor the language I select with the lang tag? Edit: I just realized this also happens with real words, including a case that used to work (valid Turkish embedded in an English multilingual voice). Am I doing something wrong with the syntax? Does the <lang> tag not do anything anymore?

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,435 questions
{count} votes