Time of Audio, Visemes and Blendshapes in Speech SDK

Question

Time of Audio, Visemes and Blendshapes in Speech SDK

Matt Ma 0

Hi, I'm collecting visemes and blendshapes from the callback event when generating speech using the speak_ssml_async().get() call. All the data is coming back okay. However, I am confused about how to align the visemes and blendshapes. So I have a couple of questions I'm hoping you can clarify: i) The code snippet in the docs https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-csharp#get-viseme-events-with-the-speech-sdk Implies that the blendshape data is inside the same event that is triggered by a viseme being available. However, when I implement this it appears that the all the visemes (with non-zero audio_offset) are sent first and after that a batch of zero audio offset, zero viseme_id events are fired and they provide the blendshape data. Is this expected behaviour? (See below)

ii) The blendshapes appear to be generated at 60 frames per second. Is there a way to change this? My application will only ever use animations at 30 FPS so 1/2 of the blendshape data will be thrown away. Can I set the framerate used to generate the blendshapes? iii) The Audio Duration property of the generated speech is always longer than the duration derived from both the audio offsets of the visemes and the frame count from the viseme events containing blendshape data. The duration is typically of the order of 0.7 seconds. Is this expected behaviour? Thanks in advance

Example of invocation log for the viseme callback that only prints Animation FrameIndex when the event.animation is not an empty string. DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=500000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=1000000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=2000000, viseme_id=7) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=2750000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=3625000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=4375000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=5000000, viseme_id=21) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=5750000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=6625000, viseme_id=20) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=7500000, viseme_id=21) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=8250000, viseme_id=13) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=8875000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=9500000, viseme_id=20) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=10125000, viseme_id=18) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=10750000, viseme_id=1) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=11875000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=12750000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=13750000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=15500000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=16625000, viseme_id=7) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=18000000, viseme_id=21) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=19125000, viseme_id=1) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=19875000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=20625000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=21125000, viseme_id=16) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=21625000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=22250000, viseme_id=3) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=24125000, viseme_id=18) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=25500000, viseme_id=13) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=26250000, viseme_id=11) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=27625000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=28750000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=30500000, viseme_id=20) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=31375000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=32750000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=34250000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=35000000, viseme_id=7) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=35625000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36000000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36375000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36625000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=37375000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=37937500, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=38500000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=39000000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=39500000, viseme_id=2) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=40250000, viseme_id=14) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=41375000, viseme_id=1) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=42375000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=44500000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 0 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 2 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 19 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 33 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 47 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 69 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 83 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 100 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 115 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=52250000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=52750000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=54500000, viseme_id=13) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=55625000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=58370000, viseme_id=0)

dupammi 8,615 Reputation points Microsoft External Staff

2024-02-05T07:38:50.9766667+00:00

Hi @Matt Ma ,

Thank you for using the Microsoft Q&A forum. It seems to be expected behaviour that, the viseme events with non-zero audio_offset represent the start of each viseme, while the events with zero audio_offset and zero viseme_id represent the end of each viseme. The blendshape data is included in the event represents the end of each viseme.

If your application only uses animations at 30 FPS, you can choose to discard every other frame of blendshape data to match your desired frame rate. You can try to downsample the blendshape data to 30 FPS by taking every other frame.

Regarding the Audio Duration property, it represents the total duration of the synthesized speech, including any pauses or silent periods. It may be longer than the sum of individual viseme durations due to additional processing time, pauses, or other factors introduced during synthesis. The duration derived from the viseme events and blendshape data only represents the duration of the speech that has corresponding viseme and blendshape data.

I hope this helps!
dupammi 8,615 Reputation points Microsoft External Staff

2024-02-06T10:58:22.3166667+00:00

Hi @Matt Ma , Following up to see if the above suggestion was helpful. Thank you.
Matt Ma 0 Reputation points

2024-02-06T14:04:13.4466667+00:00

Yes! That does clarify things. Thanks. I managed to get it working.

A follow up question, though, the visemes and the blendshapes conflict because both try to control the mouth tongue and jaw. Unless there is some smart way to merge the blendshapes and visemes that I don't know, then it seems one has to also throw away the majority of the blendshapes in order to get good lipsync. So discarding 1/2 the frames and most of the blendshapes means that the significant amount of bandwidth used to send the blendshape data is redundant. Is there a way to combine them to get even better lip mouth movements?
dupammi 8,615 Reputation points Microsoft External Staff

2024-02-06T15:00:04.1933333+00:00

Hi @Matt Ma , I'm glad to hear that you were able to get it working!

Regarding your follow-up question, visemes and blendshapes can conflict with each other since they both control mouth, tongue, and jaw movements.

To merge blendshapes and visemes for better lip sync, you can collect viseme data alongside speech synthesis using Azure Speech SDK. Embed the viseme data with audio clips during synthesis to ensure synchronization. Then, use this as Audio Source to incorporate viseme data during playback and adjust character mouth movements. To resolve conflicts between blendshapes and visemes, you can try techniques like weighted blending or custom mapping. Additionally, you can optimize blendshape usage by discarding redundant frames or smoothing values to match the animation's desired frame rate.

By following these steps, you can ensure synchrony of blendshapes and visemes, resulting in realistic lip sync within your application.

I hope this helps! Thank you.

1 answer

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2024-02-05T07:38:50.9766667+00:00

Hi @Matt Ma ,

Thank you for using the Microsoft Q&A forum. It seems to be expected behaviour that, the viseme events with non-zero audio_offset represent the start of each viseme, while the events with zero audio_offset and zero viseme_id represent the end of each viseme. The blendshape data is included in the event represents the end of each viseme.

If your application only uses animations at 30 FPS, you can choose to discard every other frame of blendshape data to match your desired frame rate. You can try to downsample the blendshape data to 30 FPS by taking every other frame.

Regarding the Audio Duration property, it represents the total duration of the synthesized speech, including any pauses or silent periods. It may be longer than the sum of individual viseme durations due to additional processing time, pauses, or other factors introduced during synthesis. The duration derived from the viseme events and blendshape data only represents the duration of the speech that has corresponding viseme and blendshape data.

I hope this helps!
dupammi 8,615 Reputation points Microsoft External Staff

2024-02-06T10:58:22.3166667+00:00

Hi @Matt Ma , Following up to see if the above suggestion was helpful. Thank you.
Matt Ma 0 Reputation points

2024-02-06T14:04:13.4466667+00:00

Yes! That does clarify things. Thanks. I managed to get it working.

A follow up question, though, the visemes and the blendshapes conflict because both try to control the mouth tongue and jaw. Unless there is some smart way to merge the blendshapes and visemes that I don't know, then it seems one has to also throw away the majority of the blendshapes in order to get good lipsync. So discarding 1/2 the frames and most of the blendshapes means that the significant amount of bandwidth used to send the blendshape data is redundant. Is there a way to combine them to get even better lip mouth movements?
dupammi 8,615 Reputation points Microsoft External Staff

2024-02-06T15:00:04.1933333+00:00

Hi @Matt Ma , I'm glad to hear that you were able to get it working!

Regarding your follow-up question, visemes and blendshapes can conflict with each other since they both control mouth, tongue, and jaw movements.

To merge blendshapes and visemes for better lip sync, you can collect viseme data alongside speech synthesis using Azure Speech SDK. Embed the viseme data with audio clips during synthesis to ensure synchronization. Then, use this as Audio Source to incorporate viseme data during playback and adjust character mouth movements. To resolve conflicts between blendshapes and visemes, you can try techniques like weighted blending or custom mapping. Additionally, you can optimize blendshape usage by discarding redundant frames or smoothing values to match the animation's desired frame rate.

By following these steps, you can ensure synchrony of blendshapes and visemes, resulting in realistic lip sync within your application.

I hope this helps! Thank you.

Answer 1

Hi @Matt Ma ,

Thank you for your response and confirming that my above comment response clarified things and that you managed to get it working. To reiterate the resolution here, let me jot down the gist of my 2 comment responses & my response for your follow-up question above, so that other community users having the same question, can easily reference this!

It seems to be expected behaviour that, the viseme events with non-zero audio_offset represent the start of each viseme, while the events with zero audio_offset and zero viseme_id represent the end of each viseme. The blendshape data is included in the event represents the end of each viseme.

If your application only uses animations at 30 FPS, you can choose to discard every other frame of blendshape data to match your desired frame rate. You can try to downsample the blendshape data to 30 FPS by taking every other frame.

Regarding the Audio Duration property, it represents the total duration of the synthesized speech, including any pauses or silent periods. It may be longer than the sum of individual viseme durations due to additional processing time, pauses, or other factors introduced during synthesis. The duration derived from the viseme events and blendshape data only represents the duration of the speech that has corresponding viseme and blendshape data.

Regarding your follow-up question, visemes and blendshapes can conflict with each other since they both control mouth, tongue, and jaw movements.

To merge blendshapes and visemes for better lip sync, you can collect viseme data alongside speech synthesis using Azure Speech SDK. Embed the viseme data with audio clips during synthesis to ensure synchronization. Then, use this as Audio Source to incorporate viseme data during playback and adjust character mouth movements. To resolve conflicts between blendshapes and visemes, you can try techniques like weighted blending or custom mapping. Additionally, you can optimize blendshape usage by discarding redundant frames or smoothing values to match the animation's desired frame rate.

By following these steps, you can ensure synchrony of blendshapes and visemes, resulting in realistic lip sync within your application.

I hope this helps!

Thank you again for your time and patience throughout this issue.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Share via

Time of Audio, Visemes and Blendshapes in Speech SDK

1 answer

Your answer