Hi,
I'm collecting visemes and blendshapes from the callback event when generating speech using the speak_ssml_async().get() call. All the data is coming back okay. However, I am confused about how to align the visemes and blendshapes. So I have a couple of questions I'm hoping you can clarify:
i) The code snippet in the docs https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-csharp#get-viseme-events-with-the-speech-sdk Implies that the blendshape data is inside the same event that is triggered by a viseme being available. However, when I implement this it appears that the all the visemes (with non-zero audio_offset) are sent first and after that a batch of zero audio offset, zero viseme_id events are fired and they provide the blendshape data. Is this expected behaviour? (See below)
ii) The blendshapes appear to be generated at 60 frames per second. Is there a way to change this? My application will only ever use animations at 30 FPS so 1/2 of the blendshape data will be thrown away. Can I set the framerate used to generate the blendshapes?
iii) The Audio Duration property of the generated speech is always longer than the duration derived from both the audio offsets of the visemes and the frame count from the viseme events containing blendshape data. The duration is typically of the order of 0.7 seconds. Is this expected behaviour?
Thanks in advance
Example of invocation log for the viseme callback that only prints Animation FrameIndex when the event.animation is not an empty string.
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=500000, viseme_id=0)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=1000000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=2000000, viseme_id=7)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=2750000, viseme_id=6)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=3625000, viseme_id=15)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=4375000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=5000000, viseme_id=21)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=5750000, viseme_id=6)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=6625000, viseme_id=20)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=7500000, viseme_id=21)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=8250000, viseme_id=13)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=8875000, viseme_id=4)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=9500000, viseme_id=20)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=10125000, viseme_id=18)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=10750000, viseme_id=1)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=11875000, viseme_id=15)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=12750000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=13750000, viseme_id=0)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=15500000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=16625000, viseme_id=7)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=18000000, viseme_id=21)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=19125000, viseme_id=1)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=19875000, viseme_id=4)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=20625000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=21125000, viseme_id=16)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=21625000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=22250000, viseme_id=3)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=24125000, viseme_id=18)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=25500000, viseme_id=13)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=26250000, viseme_id=11)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=27625000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=28750000, viseme_id=4)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=30500000, viseme_id=20)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=31375000, viseme_id=15)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=32750000, viseme_id=0)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=34250000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=35000000, viseme_id=7)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=35625000, viseme_id=4)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36000000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36375000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=36625000, viseme_id=6)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=37375000, viseme_id=4)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=37937500, viseme_id=6)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=38500000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=39000000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=39500000, viseme_id=2)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=40250000, viseme_id=14)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=41375000, viseme_id=1)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=42375000, viseme_id=15)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=44500000, viseme_id=0)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 0
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 2
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 19
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 33
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 47
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 69
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 83
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 100
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=0, viseme_id=0) + Animation FrameIndex: 115
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=52250000, viseme_id=0)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=52750000, viseme_id=19)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=54500000, viseme_id=13)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=55625000, viseme_id=6)
DEBUG:YakChatAPI:SpeechSynthesisVisemeEventArgs(audio_offset=58370000, viseme_id=0)