I found different behaviors when using IPA phonemes in text-to-speech: Vietnamese: "không phải [May] xin lỗi" Flows naturally without pauses <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-AvaMultilingualNeural">không phải <phoneme alphabet="ipa" ph="meɪ">May</phoneme> xin lỗi </voice></speak> Thai: "ฉันไม่ใช่ [May] ขอโทษ" Has pause before IPA only Continues smoothly after <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-AvaMultilingualNeural">ฉันไม่ใช่ <phoneme alphabet="ipa" ph="meɪ">May</phoneme> ขอโทษ</voice></speak> Is this expected behavior? Any way to remove the pre-IPA pause in Thai?

Hi I'm MariOhn , The difference in behavior you’re observing with text-to-speech (TTS) using IPA phonemes in Vietnamese and Thai may stem from how the TTS system handles language-specific prosody, especially with languages that have tonal or complex phonetic structures. We’ve seen this at our customers when using Azure Cognitive Services and identified a few patterns: Vietnamese : When using Vietnamese, TTS systems often handle phoneme insertions smoothly due to the language’s relatively consistent syllable structure and the absence of certain complex prosodic markers that trigger pauses. Vietnamese TTS may also be optimized for flowing around phonetic tags, given how often phonemes are intermixed with native phrases in TTS applications. Thai : Thai, however, has more intricate tonal rules and phoneme spacing, which may lead to the insertion of pauses around phonemes. In some TTS implementations, inserting IPA phonemes can trigger a slight pause due to the specific handling of tonal adjustments around foreign phonemes, like those marked in IPA. Solutions : To attempt to minimize the pause before the IPA in Thai, try: Switching voices : Some TTS voices handle IPA tags differently, especially when working with non-English phrases. Adjusting speed and pitch : This can sometimes encourage smoother blending across IPA tags. Alternative phonetic input : Using alternative phonetic spellings that approximate the sound without IPA can occasionally produce smoother output in languages sensitive to pauses. These variations in handling may not be entirely avoidable without modifying the TTS engine itself, as they are often based on language-specific models in the backend processing of multilingual TTS systems. When using Azure Cognitive Services, where speech synthesis and TTS are core features, these solutions can help improve the experience in multilingual applications.

Thai text has pause before IPA phoneme, Vietnamese doesn't - why?

Accepted answer

RevelinoB 3,345 Reputation points

2024-10-30T08:38:19.83+00:00
Hi I'm MariOhn,

The difference in behavior you’re observing with text-to-speech (TTS) using IPA phonemes in Vietnamese and Thai may stem from how the TTS system handles language-specific prosody, especially with languages that have tonal or complex phonetic structures. We’ve seen this at our customers when using Azure Cognitive Services and identified a few patterns:

Vietnamese: When using Vietnamese, TTS systems often handle phoneme insertions smoothly due to the language’s relatively consistent syllable structure and the absence of certain complex prosodic markers that trigger pauses. Vietnamese TTS may also be optimized for flowing around phonetic tags, given how often phonemes are intermixed with native phrases in TTS applications.

Thai: Thai, however, has more intricate tonal rules and phoneme spacing, which may lead to the insertion of pauses around phonemes. In some TTS implementations, inserting IPA phonemes can trigger a slight pause due to the specific handling of tonal adjustments around foreign phonemes, like those marked in IPA.

Solutions: To attempt to minimize the pause before the IPA in Thai, try:

Switching voices: Some TTS voices handle IPA tags differently, especially when working with non-English phrases.

Adjusting speed and pitch: This can sometimes encourage smoother blending across IPA tags.

Alternative phonetic input: Using alternative phonetic spellings that approximate the sound without IPA can occasionally produce smoother output in languages sensitive to pauses.

These variations in handling may not be entirely avoidable without modifying the TTS engine itself, as they are often based on language-specific models in the backend processing of multilingual TTS systems. When using Azure Cognitive Services, where speech synthesis and TTS are core features, these solutions can help improve the experience in multilingual applications.
Please sign in to rate this answer.
i'm MariOhn 81 Reputation points

2024-10-30T09:15:26.34+00:00

I've tested both suggested solutions but found:

1. Issue persists regardless of:

Switching voices (even with Thai-specific voices)

Speed/pitch adjustments

2. Key observations:

The pause ONLY occurs before IPA in Thai text

Text after IPA flows smoothly

Using <lexicon> tag solves this issue completely, but unfortunately isn't supported in multilingual voices

My test case remains:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="th-TH"><voice name="th-TH-PremwadeeNeural" tailingsilence="200ms">ฉันไม่ได้ชื่อ <phoneme alphabet="ipa" ph="seː̄.lēn.sā.kiː́̋̀">เซเลนสกี้</phoneme> นะคะ </voice></speak>

Do you have any other solutions, given that <lexicon> isn't available for multilingual voices and the suggested adjustments don't resolve the pause?

i'm MariOhn 81 Reputation points

2024-10-30T09:17:24.6+00:00

I've tested both suggested solutions but found:

1. Issue persists regardless of:

Switching voices (even with Thai-specific voices)

Speed/pitch adjustments

2. Key observations:

The pause ONLY occurs before IPA in Thai text

Text after IPA flows smoothly

Using <lexicon> tag solves this issue completely, but unfortunately isn't supported in multilingual voices

My test case remains:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="th-TH"><voice name="th-TH-PremwadeeNeural" tailingsilence="200ms">ฉันไม่ได้ชื่อ <phoneme alphabet="ipa" ph="seː̄.lēn.sā.kiː́̋̀">เซเลนสกี้</phoneme> นะคะ </voice></speak>

Do you have any other solutions, given that <lexicon> isn't available for multilingual voices and the suggested adjustments don't resolve the pause?

RevelinoB 3,345 Reputation points

2024-10-30T09:22:04.36+00:00

Given that the <lexicon> tag, which would typically resolve this issue, is unsupported for multilingual voices, and the voice switching and speed/pitch adjustments haven’t helped, we can explore a few additional approaches to try minimizing the pause:

IPA as Plain Text: Since Thai TTS treats IPA phonemes with a pause beforehand, consider trying the IPA transcription as plain text rather than within a <phoneme> tag. Thai-specific voices may interpret phonetic text in the same way as native script, which might flow more naturally.

Break or Prosody Tag: Experiment with a <break> or <prosody> tag immediately before the <phoneme> tag to smooth transitions. Sometimes specifying a very short break (e.g., break time="1ms") can help bridge the transition without introducing a noticeable pause.

Custom Phrase Rephrasing: Consider rephrasing or adding filler sounds or words directly before the **<phoneme>**tag to smooth over the transition. For instance, adding a small filler word in Thai, such as a neutral sound or expression, can sometimes mitigate unwanted pauses.

Direct Voice Feedback: Since Azure has made recent updates in multilingual TTS, contacting Azure Cognitive Services support with this specific use case can help as they may provide backend adjustments or additional troubleshooting options. Given that your use case involves a common TTS feature request, they may offer insights or prioritization for a future <lexicon> tag feature.

If none of these approaches resolve the pause, it’s likely a limitation in the current TTS engine's processing of multilingual phoneme integrations, particularly for tonal languages like Thai.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Thai text has pause before IPA phoneme, Vietnamese doesn't - why?

0 additional answers

Your answer