In Azure's text-to-speech service, there is currently no direct API or method provided to obtain the phoneme sequence corresponding to the viseme ID. The visemes represent visual speech information, while the phonemes represent the individual speech sounds. As you mentioned, there is a many-to-one mapping between visemes and phonemes, which makes it challenging to precisely determine the phoneme sequence from viseme information alone.
The Speech Synthesis Markup Language (SSML) phonetic sets you referred to provide a mapping between phonemes and visemes for specific languages. However, it's important to note that these mappings are approximate and can vary depending on factors such as voice quality, language variations, and speech styles.
If you require the exact phoneme sequence for your specific use case, you may need to explore other approaches or technologies that focus specifically on phonetic analysis and transcription. These techniques typically involve more advanced speech processing and analysis algorithms.
Alternatively, you can consider leveraging Azure Speech Service's Speech-to-Text capability to transcribe the spoken audio and obtain a phonetic transcription directly. The Speech-to-Text service is designed to convert spoken language into written text, including the ability to generate phonetic transcriptions.
It's recommended to evaluate your specific requirements and consult the official Azure documentation and resources for further guidance on integrating speech recognition or phonetic analysis capabilities into your application.