How to get the generated phonemes in azure's tts service

huihuihuihui 0 Reputation points
2023-05-11T08:09:41.4733333+00:00

I am using Azure's text-to-speech service for zh-CN. We can get the viseme ID and the start time of each viseme. Is there any way to get the phoneme sequence corresponding to the viseme ID at the same time?
I know there is a speech phonetic alphabet table (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-ssml-phonetic- sets). But it is difficult to get the exact phoneme sequence because the phoneme and the viseme are many to one.

Thank you for your answer.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,391 questions
{count} votes

1 answer

Sort by: Most helpful
  1. VasimTamboli 4,410 Reputation points
    2023-05-14T10:57:36.8466667+00:00

    In Azure's text-to-speech service, there is currently no direct API or method provided to obtain the phoneme sequence corresponding to the viseme ID. The visemes represent visual speech information, while the phonemes represent the individual speech sounds. As you mentioned, there is a many-to-one mapping between visemes and phonemes, which makes it challenging to precisely determine the phoneme sequence from viseme information alone.

    The Speech Synthesis Markup Language (SSML) phonetic sets you referred to provide a mapping between phonemes and visemes for specific languages. However, it's important to note that these mappings are approximate and can vary depending on factors such as voice quality, language variations, and speech styles.

    If you require the exact phoneme sequence for your specific use case, you may need to explore other approaches or technologies that focus specifically on phonetic analysis and transcription. These techniques typically involve more advanced speech processing and analysis algorithms.

    Alternatively, you can consider leveraging Azure Speech Service's Speech-to-Text capability to transcribe the spoken audio and obtain a phonetic transcription directly. The Speech-to-Text service is designed to convert spoken language into written text, including the ability to generate phonetic transcriptions.

    It's recommended to evaluate your specific requirements and consult the official Azure documentation and resources for further guidance on integrating speech recognition or phonetic analysis capabilities into your application.