How to get the generated phonemes in azure's tts service

Question

How to get the generated phonemes in azure's tts service

huihuihuihui 0

I am using Azure's text-to-speech service for zh-CN. We can get the viseme ID and the start time of each viseme. Is there any way to get the phoneme sequence corresponding to the viseme ID at the same time?
I know there is a speech phonetic alphabet table (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-ssml-phonetic- sets). But it is difficult to get the exact phoneme sequence because the phoneme and the viseme are many to one.

Thank you for your answer.

YutongTie-MSFT 53,976 Reputation points Moderator

2023-05-12T01:21:36.4466667+00:00

Hello @huihuihuihui

Thanks for reaching out to us - The mapping between visemes and phonemes is indeed many-to-one, which makes it difficult to determine the exact phoneme sequence based on the viseme ID alone. However, one possible approach is to use a speech recognition system to transcribe the audio and obtain a phoneme sequence from the transcription.

You could use Azure Cognitive Services Speech-to-Text API to perform automatic speech recognition on the audio and obtain a phoneme sequence. The Speech-to-Text API provides a phoneme output option that produces a phoneme sequence in the International Phonetic Alphabet (IPA) format. You could then align the phoneme sequence with the viseme IDs and start times to determine the corresponding phoneme sequence for each viseme.

Note that automatic speech recognition systems are not perfect and may produce errors in the transcription. Additionally, the accuracy of the alignment between the phoneme sequence and the viseme IDs may depend on the quality of the viseme and phoneme models used by the TTS system.

I hope this information is helpful.
huihuihuihui 0 Reputation points

2023-05-14T03:47:16.1766667+00:00

Thanks for helping out!
huihuihuihui 0 Reputation points

2023-05-14T03:55:23.5033333+00:00

I have another question about Azure tts. When I try to get the Blendshape coefficients, Azure documentation mentions that BlendShape is expressed as a decimal value between 0 and 1. But I found that the generated parameters will be negative, for example, the PDF file on this page (https://learn.microsoft.com/en-us/answers/questions/1185396/azure-speech-poor-viseme-blendshape-quality), MouthSmileRight is sometimes negative. I am driving my model in Maya. I changed the negative number to 0, which resulted in a poor expression. I am wondering if this is the reason. I would like to know why the negative number is there. And how to deal with these negative numbers.

1 answer

Your answer

YutongTie-MSFT 53,976 Reputation points Moderator

2023-05-12T01:21:36.4466667+00:00

Hello @huihuihuihui

Thanks for reaching out to us - The mapping between visemes and phonemes is indeed many-to-one, which makes it difficult to determine the exact phoneme sequence based on the viseme ID alone. However, one possible approach is to use a speech recognition system to transcribe the audio and obtain a phoneme sequence from the transcription.

You could use Azure Cognitive Services Speech-to-Text API to perform automatic speech recognition on the audio and obtain a phoneme sequence. The Speech-to-Text API provides a phoneme output option that produces a phoneme sequence in the International Phonetic Alphabet (IPA) format. You could then align the phoneme sequence with the viseme IDs and start times to determine the corresponding phoneme sequence for each viseme.

Note that automatic speech recognition systems are not perfect and may produce errors in the transcription. Additionally, the accuracy of the alignment between the phoneme sequence and the viseme IDs may depend on the quality of the viseme and phoneme models used by the TTS system.

I hope this information is helpful.
huihuihuihui 0 Reputation points

2023-05-14T03:47:16.1766667+00:00

Thanks for helping out!
huihuihuihui 0 Reputation points

2023-05-14T03:55:23.5033333+00:00

I have another question about Azure tts. When I try to get the Blendshape coefficients, Azure documentation mentions that BlendShape is expressed as a decimal value between 0 and 1. But I found that the generated parameters will be negative, for example, the PDF file on this page (https://learn.microsoft.com/en-us/answers/questions/1185396/azure-speech-poor-viseme-blendshape-quality), MouthSmileRight is sometimes negative. I am driving my model in Maya. I changed the negative number to 0, which resulted in a poor expression. I am wondering if this is the reason. I would like to know why the negative number is there. And how to deal with these negative numbers.

Answer 1

In Azure's text-to-speech service, there is currently no direct API or method provided to obtain the phoneme sequence corresponding to the viseme ID. The visemes represent visual speech information, while the phonemes represent the individual speech sounds. As you mentioned, there is a many-to-one mapping between visemes and phonemes, which makes it challenging to precisely determine the phoneme sequence from viseme information alone.

The Speech Synthesis Markup Language (SSML) phonetic sets you referred to provide a mapping between phonemes and visemes for specific languages. However, it's important to note that these mappings are approximate and can vary depending on factors such as voice quality, language variations, and speech styles.

If you require the exact phoneme sequence for your specific use case, you may need to explore other approaches or technologies that focus specifically on phonetic analysis and transcription. These techniques typically involve more advanced speech processing and analysis algorithms.

Alternatively, you can consider leveraging Azure Speech Service's Speech-to-Text capability to transcribe the spoken audio and obtain a phonetic transcription directly. The Speech-to-Text service is designed to convert spoken language into written text, including the ability to generate phonetic transcriptions.

It's recommended to evaluate your specific requirements and consult the official Azure documentation and resources for further guidance on integrating speech recognition or phonetic analysis capabilities into your application.

huihuihuihui 0 Reputation points

2023-05-18T07:27:31.73+00:00

I have another question about Azure tts. When I try to get the Blendshape coefficients, Azure documentation mentions that BlendShape is expressed as a decimal value between 0 and 1. But I found that the generated parameters will be negative, for example, the PDF file on this page (https://learn.microsoft.com/en-us/answers/questions/1185396/azure-speech-poor-viseme-blendshape-quality), MouthSmileRight is sometimes negative. I am driving my model in Maya. I changed the negative number to 0, which resulted in a poor expression. I am wondering if this is the reason. I would like to know why the negative number is there. And how to deal with these negative numbers.

Share via

How to get the generated phonemes in azure's tts service

1 answer

Your answer