I am using "Azure Speech" to synthesize speech from a text input, and also to generate Viseme events with Blendshape data.
My settings are: Language: en-US. Voice name: en-US-DavisNeural
I am using the blendshape data as input to animate a 3D character in Unreal Engine. According to the documentation (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-python) the blendshapes have the same naming and order as the ARKit blendshapes, which makes it easy to drive a 3D character's facial animation.
I am attaching an example file with the blendshape data that were created by Azure Speech.
Blendshapes_Example.pdf
This is the input text: "Norway is a Scandinavian country located in Northern Europe. It is home to a population of 5.3 million people, and its capital city is Oslo. Norway is known for its stunning natural beauty, with mountains, fjords, and..."
The issues that I am facing are as follows:
- The "JawOpen" values are way too high throughout the whole list of viseme events, resulting in an animation with a very wide open mouth. I would expect the values to drop to almost zero, at least for some of the visemes.
- Most of the consonant visemes are not properly captured. All the "p", "b", "n", "m" etc sounds where the lips are supposed to touch dont have a good representation in the viseme data.
- Poor quality of viseme generation. The frequency of viseme events is not sufficient enough to capture the mouth motion. With 37 words in the input text, only 53 viseme events are recorded, giving an average of 1.4 visemes per word!
Here is a demonstration of the final animation (Note that I interpolate between blendshapes for a smoother animation):
https://youtu.be/Bxd_I6K8qHQ
I understand that the quality of viseme data depends on the quality of the speech synthesis. However, I cannot find any parameter or setting that could improve the quality of synthesis in a way that would have an effect on the quality of viseme data.
So my question is: Is there an inherent flaw in the way Azure text-to-speech generates blendshape viseme events? Or is there anything I need to change in the config settings (or something else), to get a better result?
This is the Python code I use, in case it might be of relevance:
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
ssml = f'<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-US-DavisNeural"><mstts:viseme type="FacialExpression"/>{text}</voice></speak>'
Subscribe to animation received event
speech_synthesizer.synthesis_completed.connect(viseme_cb)
Subscribe to viseme received event
speech_synthesizer.viseme_received.connect(viseme_cb)
Synthesize speech from SSML
result = speech_synthesizer.speak_ssml_async(ssml).get()