Azure Speech - Poor Viseme Blendshape Quality

RealtimeGraphX 20 Reputation points
2023-03-01T10:25:07.2733333+00:00

I am using "Azure Speech" to synthesize speech from a text input, and also to generate Viseme events with Blendshape data.

My settings are: Language: en-US. Voice name: en-US-DavisNeural

I am using the blendshape data as input to animate a 3D character in Unreal Engine. According to the documentation (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-python) the blendshapes have the same naming and order as the ARKit blendshapes, which makes it easy to drive a 3D character's facial animation.

I am attaching an example file with the blendshape data that were created by Azure Speech.

Blendshapes_Example.pdf

This is the input text: "Norway is a Scandinavian country located in Northern Europe. It is home to a population of 5.3 million people, and its capital city is Oslo. Norway is known for its stunning natural beauty, with mountains, fjords, and..."

The issues that I am facing are as follows:

  1. The "JawOpen" values are way too high throughout the whole list of viseme events, resulting in an animation with a very wide open mouth. I would expect the values to drop to almost zero, at least for some of the visemes.
  2. Most of the consonant visemes are not properly captured. All the "p", "b", "n", "m" etc sounds where the lips are supposed to touch dont have a good representation in the viseme data.
  3. Poor quality of viseme generation. The frequency of viseme events is not sufficient enough to capture the mouth motion. With 37 words in the input text, only 53 viseme events are recorded, giving an average of 1.4 visemes per word!

Here is a demonstration of the final animation (Note that I interpolate between blendshapes for a smoother animation):

https://youtu.be/Bxd_I6K8qHQ

I understand that the quality of viseme data depends on the quality of the speech synthesis. However, I cannot find any parameter or setting that could improve the quality of synthesis in a way that would have an effect on the quality of viseme data.

So my question is: Is there an inherent flaw in the way Azure text-to-speech generates blendshape viseme events? Or is there anything I need to change in the config settings (or something else), to get a better result?

This is the Python code I use, in case it might be of relevance:

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

ssml = f'<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-US-DavisNeural"><mstts:viseme type="FacialExpression"/>{text}</voice></speak>'

Subscribe to animation received event

speech_synthesizer.synthesis_completed.connect(viseme_cb)

Subscribe to viseme received event

speech_synthesizer.viseme_received.connect(viseme_cb)

Synthesize speech from SSML

result = speech_synthesizer.speak_ssml_async(ssml).get()

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,432 questions
{count} votes

Accepted answer
  1. romungi-MSFT 42,586 Reputation points Microsoft Employee
    2023-03-02T08:11:53.0166667+00:00

    RealtimeGraphX I do not have much experience working with this feature of speech service but as per documentation the frame index value indicates how many frames preceded the current list of frames. The frames mentioned in your list does show the indexes without all values between frames. For ex: index 0 and 11 are available but 1 to 10 are missing. As per your observation this could be the cause of animation not being created as expected.

    I have run a test of the same sentence with my resource in westeurope with the same voice and observed that the frameindex value lists all the preceding frames in the result with all the facial positions in the order mentioned.

    User's image

    Ex: The frameindex value 59 contains all the indexes from 59 to 135, frameindex 0 contains all the indexes from 0 to 58 and so on. Is your result similar or different with only the indexes you mentioned in the list?

    I have only printed the animation events for debugging for this issue. I am also attaching the same so you could check if this works with your existing setup to generate animation.

    I hope this helps!!

    viseme_animation_events.txt

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


1 additional answer

Sort by: Most helpful
  1. Singh Mahesh Kumar 0 Reputation points
    2023-07-18T11:31:39.4+00:00

    Hello Guys, I am new to this. Could someone guide me on how to make a 3d character lip sync with the input text. I read the documentation on [https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-python], but could not get the whole flow. Please Help.

    0 comments No comments