Different word pronounciation

Steinkrug, Michelle 71

Hello,

I´m currently working with MS Speech Studio and I´m using the Text-to Speech function. Currently I produce German audio files and I was wondering why sometimes the words are spoken correctly and sometimes not.

Here is an example. If I write the following sentence in MS Speech Studio:

"Diese Phase ist als Dichteunterschied von 10 bis 30 Hounsfield-Einheiten zwischen Aorta abdominalis und Vena cava inferior definiert."

Then the speaker pronounciates the word "inferior" with an English accent and there is nor reason for it. If I add an additional blank space between the word "inferior" and "definiert" suddenly the speaker speaks the word correctly and recognizes, that this is a latin term. In some other positions of the entire text he also recognizes and pronounciates the word "inferior" correctly, only together with the word "definiert" it is currently not working.

Best,

romungi-MSFT 41,866 Reputation points Microsoft Employee

2022-07-19T11:05:43.153+00:00

@Steinkrug, Michelle Which voice name are you using in this scenario? I just tried to use the same sentence in the studio but the sentence seems to sound the same with or without the extra space.
It would be great if you could download the audio if possible to check and report for any issues with the voice. Thanks!!
Steinkrug, Michelle 71 Reputation points

2022-07-19T13:45:18.12+00:00

Hi,

I hope there is no misunderstanding. I´m talking about different pronounciations of one word in a complete text snippet.

Here is tan example for the complete text snippet:

Das Anreicherungsmuster korreliert mit dem Durchfluss des Kontrastmittels durch den Blutkreislauf. Sie definiert sich als Dichteunterschied von mindestens 30 Hounsfield-Einheiten zwischen Aorta abdominalis und Vena cava inferior. Diese Phase ist als Dichteunterschied zwischen Aorta abdominalis und Vena cava inferior definiert. Sie definiert sich als Dichteunterschied von weniger als 10 Hounsfield-Einheiten zwischen Aorta abdominalis und Vena cava inferior. Wählen Sie den Pfeil unten auf dem Bildschirm aus, um mehr über die Einflussfaktoren für eine optimale Kontrastmittelanreicherung der interessierenden Region zu erfahren.

The word inferior appears three times in this text snippet, but one time it speaks the word inferior complete different in comparison to the other ones.
The different pronounciation is audible here: Diese Phase ist als Dichteunterschied zwischen Aorta abdominalis und Vena cava inferior definiert.

I have used the German Language and the Speaker Conrad. I wanted to upload the MP3 file, but unfortunately this is not possible.

best
Steinkrug, Michelle 71 Reputation points

2022-07-26T07:42:42.49+00:00

Hi,

can you maybe give me an advice or a hint why the pronounciation is different?

best

romungi-MSFT 41,866 Microsoft Employee

In this case you could use the following:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="de-DE"><voice name="Microsoft Server Speech Text to Speech Voice (de-DE, ConradNeural)" xml:lang="de-DE" xml:gender="Male">Sie definiert sich als Dichteunterschied von mindestens 30 Hounsfield-Einheiten   
 zwischen Aorta abdominalis und Vena cava <lang xml:lang="de-DE"> inferior </lang>. Diese Phase ist als Dichteunterschied zwischen Aorta abdominalis und Vena cava <lang xml:lang="de-DE"> inferior </lang>definiert. Sie definiert sich als Dichteunterschied von weniger als 10 Hounsfield-Einheiten  
 zwischen Aorta abdominalis und Vena cava <lang xml:lang="de-DE"> inferior </lang>.</voice></speak>

Steinkrug, Michelle 71 Reputation points

2022-07-26T08:19:58.123+00:00

Thanks for this quick reply. IIt was very helpful.

Accepted answer

romungi-MSFT 41,866 Reputation points Microsoft Employee

2022-07-26T08:00:31.707+00:00

@Steinkrug, Michelle I have just received some feedback from product team that this could be an issue with language detection by the model and it is observed that in some cases the model identifies the word with different language id, in this case it is en-US so the pronunciation sounds as English with a German voice. One workaround that has been suggested is to use the <lang> tag in the SSML for such a discrepancy to ensure the model explicitly pronounces the word in German. This is not an ideal scenario if you are using a real time scenario as input but if you are creating audio files for offline use you could use the workaround and generate an appropriate sounding file.

If an answer is helpful, please click on or upvote which might help other community members reading this thread.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Different word pronounciation

0 additional answers