Azure Speech - Poor Viseme Blendshape Quality

Question

Azure Speech - Poor Viseme Blendshape Quality

RealtimeGraphX 20

I am using "Azure Speech" to synthesize speech from a text input, and also to generate Viseme events with Blendshape data.

My settings are: Language: en-US. Voice name: en-US-DavisNeural

I am using the blendshape data as input to animate a 3D character in Unreal Engine. According to the documentation (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-python) the blendshapes have the same naming and order as the ARKit blendshapes, which makes it easy to drive a 3D character's facial animation.

I am attaching an example file with the blendshape data that were created by Azure Speech.

Blendshapes_Example.pdf

This is the input text: "Norway is a Scandinavian country located in Northern Europe. It is home to a population of 5.3 million people, and its capital city is Oslo. Norway is known for its stunning natural beauty, with mountains, fjords, and..."

The issues that I am facing are as follows:

The "JawOpen" values are way too high throughout the whole list of viseme events, resulting in an animation with a very wide open mouth. I would expect the values to drop to almost zero, at least for some of the visemes.
Most of the consonant visemes are not properly captured. All the "p", "b", "n", "m" etc sounds where the lips are supposed to touch dont have a good representation in the viseme data.
Poor quality of viseme generation. The frequency of viseme events is not sufficient enough to capture the mouth motion. With 37 words in the input text, only 53 viseme events are recorded, giving an average of 1.4 visemes per word!

Here is a demonstration of the final animation (Note that I interpolate between blendshapes for a smoother animation):

https://youtu.be/Bxd_I6K8qHQ

I understand that the quality of viseme data depends on the quality of the speech synthesis. However, I cannot find any parameter or setting that could improve the quality of synthesis in a way that would have an effect on the quality of viseme data.

So my question is: Is there an inherent flaw in the way Azure text-to-speech generates blendshape viseme events? Or is there anything I need to change in the config settings (or something else), to get a better result?

This is the Python code I use, in case it might be of relevance:

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

ssml = f'<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-US-DavisNeural"><mstts:viseme type="FacialExpression"/>{text}</voice></speak>'

speech_synthesizer.synthesis_completed.connect(viseme_cb)

speech_synthesizer.viseme_received.connect(viseme_cb)

Synthesize speech from SSML

result = speech_synthesizer.speak_ssml_async(ssml).get()

RealtimeGraphX 20 Reputation points

2023-03-02T06:13:54.0066667+00:00

Thank you for your answer! That pointed me into the right direction. I suppose the issue was in the way that I print the viseme data to the output file, where only the line with "FrameIndex" at the beginning is considered and all other blendshapes are dropped. I will need to change the code a bit, so the data can be processed for animation, and will test it then.
Muaz Khan 0 Reputation points

2023-08-01T12:24:05.5866667+00:00

Azure Speech is a cloud-based service provided by Microsoft that offers various speech and voice-related capabilities, including speech-to-text, text-to-speech, speech translation, and more.

Based on your description, it seems like you are facing issues related to viseme blendshape quality in Azure Speech's text-to-speech feature. Visemes are facial expressions or mouth shapes that correspond to different phonemes or speech sounds, and blendshapes are the deformations applied to a 3D model to create these facial expressions.

If you are experiencing poor viseme blendshape quality with Azure Speech's text-to-speech, here are some possible reasons and suggestions to improve it:

Model Quality: The quality of the 3D model used to generate the visemes can greatly impact the final result. Make sure you are using a high-quality 3D model with well-defined blendshapes for accurate facial expressions.

Animation Parameters: Adjusting the animation parameters for the 3D model can help achieve more realistic and expressive visemes. Experiment with different settings to find the best combination.

Audio Input Quality: The accuracy of the generated visemes can also depend on the quality of the input audio. Ensure that the audio being fed to the text-to-speech engine is clear and of good quality.

Phoneme Mapping: The mapping between phonemes and visemes can vary based on the language and accents. Check if the phoneme-to-viseme mapping is appropriate for the language and accent being used in the input text.

Post-processing: Sometimes, post-processing the output visemes can help smoothen the facial expressions and improve the overall blendshape quality. Apply filters or smoothing algorithms if necessary.

Accepted answer

1 additional answer

Your answer

RealtimeGraphX 20 Reputation points

2023-03-02T06:13:54.0066667+00:00

Thank you for your answer! That pointed me into the right direction. I suppose the issue was in the way that I print the viseme data to the output file, where only the line with "FrameIndex" at the beginning is considered and all other blendshapes are dropped. I will need to change the code a bit, so the data can be processed for animation, and will test it then.
Muaz Khan 0 Reputation points

2023-08-01T12:24:05.5866667+00:00

Azure Speech is a cloud-based service provided by Microsoft that offers various speech and voice-related capabilities, including speech-to-text, text-to-speech, speech translation, and more.

Based on your description, it seems like you are facing issues related to viseme blendshape quality in Azure Speech's text-to-speech feature. Visemes are facial expressions or mouth shapes that correspond to different phonemes or speech sounds, and blendshapes are the deformations applied to a 3D model to create these facial expressions.

If you are experiencing poor viseme blendshape quality with Azure Speech's text-to-speech, here are some possible reasons and suggestions to improve it:

Model Quality: The quality of the 3D model used to generate the visemes can greatly impact the final result. Make sure you are using a high-quality 3D model with well-defined blendshapes for accurate facial expressions.

Animation Parameters: Adjusting the animation parameters for the 3D model can help achieve more realistic and expressive visemes. Experiment with different settings to find the best combination.

Audio Input Quality: The accuracy of the generated visemes can also depend on the quality of the input audio. Ensure that the audio being fed to the text-to-speech engine is clear and of good quality.

Phoneme Mapping: The mapping between phonemes and visemes can vary based on the language and accents. Check if the phoneme-to-viseme mapping is appropriate for the language and accent being used in the input text.

Post-processing: Sometimes, post-processing the output visemes can help smoothen the facial expressions and improve the overall blendshape quality. Apply filters or smoothing algorithms if necessary.

Answer 1

romungi-MSFT 48,911 Microsoft Employee Moderator

RealtimeGraphX I do not have much experience working with this feature of speech service but as per documentation the frame index value indicates how many frames preceded the current list of frames. The frames mentioned in your list does show the indexes without all values between frames. For ex: index 0 and 11 are available but 1 to 10 are missing. As per your observation this could be the cause of animation not being created as expected.

I have run a test of the same sentence with my resource in westeurope with the same voice and observed that the frameindex value lists all the preceding frames in the result with all the facial positions in the order mentioned.

User's image

Ex: The frameindex value 59 contains all the indexes from 59 to 135, frameindex 0 contains all the indexes from 0 to 58 and so on. Is your result similar or different with only the indexes you mentioned in the list?

I have only printed the animation events for debugging for this issue. I am also attaching the same so you could check if this works with your existing setup to generate animation.

I hope this helps!!

viseme_animation_events.txt

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

RealtimeGraphX 20 Reputation points

2023-03-02T14:14:58.3366667+00:00

So, the number of blendshapes now actually corresponds to the number of frames, which is great!

However, the blendshape quality is still off. Issues 1 and 2 in my original question are still unsolved. Especially the open jaw values produce a very weird looking result. Here is a recording of the animation with the new blendshape data:

https://youtu.be/0-q-vf-vslQ
RealtimeGraphX 20 Reputation points

2023-03-02T14:15:42.0433333+00:00

Thanks for helping out!
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-02T14:53:07.84+00:00

Thanks for posting the animation. I have passed this feedback to the product team to check if the API response is complete to build an animation. Is it possible to check the same with additional voices too for en-US to ensure if the same or different behavior is seen?
RealtimeGraphX 20 Reputation points

2023-03-03T09:20:29.26+00:00

Yes, I have tried different en-US voices, both male and female. The result is basically the same.
RealtimeGraphX 20 Reputation points

2023-03-03T09:21:34.9266667+00:00

Yes, I have tried different en-US voices, both male and female. The result is basically the same.
RealtimeGraphX 20 Reputation points

2023-03-07T09:47:27.1733333+00:00

Do you have any update on this?
huihuihuihui 0 Reputation points

2023-05-11T02:49:53.9233333+00:00

I have another question about this issue. Azure documentation mentions that BlendShape is expressed as a decimal value between 0 and 1. But I found that the generated parameters will be negative, for example in the PDF provided above, MouthSmileRight is sometimes negative. I am driving my model in Maya. I changed the negative number to 0, which resulted in a poor expression. I am wondering if this is the reason. I would like to know why the negative number is there. And how to deal with these negative numbers.

Answer 2

Singh Mahesh Kumar 0

Hello Guys, I am new to this. Could someone guide me on how to make a 3d character lip sync with the input text. I read the documentation on [https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-python], but could not get the whole flow. Please Help.

Share via

Azure Speech - Poor Viseme Blendshape Quality

Subscribe to animation received event

Subscribe to viseme received event

Synthesize speech from SSML

1 additional answer

Your answer