Time of Audio, Visemes and Blendshapes in Speech SDK

Matt Ma 0 Reputation points
2024-02-03T20:57:19.8333333+00:00

Hi, I'm collecting visemes and blendshapes from the callback event when generating speech using the speak_ssml_async().get() call. All the data is coming back okay. However, I am confused about how to align the visemes and blendshapes. So I have a couple of questions I'm hoping you can clarify: i) The code snippet in the docs https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-csharp#get-viseme-events-with-the-speech-sdk Implies that the blendshape data is inside the same event that is triggered by a viseme being available. However, when I implement this it appears that the all the visemes (with non-zero audio_offset) are sent first and after that a batch of zero audio offset, zero viseme_id events are fired and they provide the blendshape data. Is this expected behaviour? (See below)

ii) The blendshapes appear to be generated at 60 frames per second. Is there a way to change this? My application will only ever use animations at 30 FPS so 1/2 of the blendshape data will be thrown away. Can I set the framerate used to generate the blendshapes? iii) The Audio Duration property of the generated speech is always longer than the duration derived from both the audio offsets of the visemes and the frame count from the viseme events containing blendshape data. The duration is typically of the order of 0.7 seconds. Is this expected behaviour? Thanks in advance


Example of invocation log for the viseme callback that only prints Animation FrameIndex when the event.animation is not an empty string. DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=500000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=1000000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=2000000, viseme_id=7) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=2750000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=3625000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=4375000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=5000000, viseme_id=21) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=5750000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=6625000, viseme_id=20) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=7500000, viseme_id=21) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=8250000, viseme_id=13) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=8875000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=9500000, viseme_id=20) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=10125000, viseme_id=18) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=10750000, viseme_id=1) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=11875000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=12750000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=13750000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=15500000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=16625000, viseme_id=7) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=18000000, viseme_id=21) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=19125000, viseme_id=1) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=19875000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=20625000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=21125000, viseme_id=16) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=21625000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=22250000, viseme_id=3) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=24125000, viseme_id=18) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=25500000, viseme_id=13) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=26250000, viseme_id=11) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=27625000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=28750000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=30500000, viseme_id=20) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=31375000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=32750000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=34250000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=35000000, viseme_id=7) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=35625000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36000000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36375000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36625000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=37375000, viseme_id=4) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=37937500, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=38500000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=39000000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=39500000, viseme_id=2) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=40250000, viseme_id=14) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=41375000, viseme_id=1) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=42375000, viseme_id=15) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=44500000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 0 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 2 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 19 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 33 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 47 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 69 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 83 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 100 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 115 DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=52250000, viseme_id=0) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=52750000, viseme_id=19) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=54500000, viseme_id=13) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=55625000, viseme_id=6) DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=58370000, viseme_id=0)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,291 questions
{count} votes

1 answer

Sort by: Most helpful
  1. dupammi 4,170 Reputation points Microsoft Vendor
    2024-02-07T04:22:09.94+00:00

    Hi @Matt Ma ,

    Thank you for your response and confirming that my above comment response clarified things and that you managed to get it working. To reiterate the resolution here, let me jot down the gist of my 2 comment responses & my response for your follow-up question above, so that other community users having the same question, can easily reference this!

    It seems to be expected behaviour that, the viseme events with non-zero audio_offset represent the start of each viseme, while the events with zero audio_offset and zero viseme_id represent the end of each viseme. The blendshape data is included in the event represents the end of each viseme.

    If your application only uses animations at 30 FPS, you can choose to discard every other frame of blendshape data to match your desired frame rate. You can try to downsample the blendshape data to 30 FPS by taking every other frame.

    Regarding the Audio Duration property, it represents the total duration of the synthesized speech, including any pauses or silent periods. It may be longer than the sum of individual viseme durations due to additional processing time, pauses, or other factors introduced during synthesis. The duration derived from the viseme events and blendshape data only represents the duration of the speech that has corresponding viseme and blendshape data.

    Regarding your follow-up question, visemes and blendshapes can conflict with each other since they both control mouth, tongue, and jaw movements.

    To merge blendshapes and visemes for better lip sync, you can collect viseme data alongside speech synthesis using Azure Speech SDK. Embed the viseme data with audio clips during synthesis to ensure synchronization. Then, use this as Audio Source to incorporate viseme data during playback and adjust character mouth movements. To resolve conflicts between blendshapes and visemes, you can try techniques like weighted blending or custom mapping. Additionally, you can optimize blendshape usage by discarding redundant frames or smoothing values to match the animation's desired frame rate.

    By following these steps, you can ensure synchrony of blendshapes and visemes, resulting in realistic lip sync within your application.

    I hope this helps!

    Thank you again for your time and patience throughout this issue.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments