@林老师 Are you using the sample from the quickstart as mentioned in the documentation?
I think the main reason that the result could be empty is because the SSML does not contain the element with following tag:
<mstts:viseme type='FacialExpression'/>
I have worked with one user from a previous issue where the events were available for the viseme_received event of the synthesizer. Here is a snippet of the code used to print the events which was later used by the user to generate animation.
import azure.cognitiveservices.speech as speechsdk
# Creates an instance of a speech config with specified subscription key and service region.
speech_config = speechsdk.SpeechConfig(subscription='<your_key>', region='<your_region>')
# Creates a speech synthesizer with a null output stream.
# This means the audio output data will not be written to any output channel.
# You can just get the audio from the result.
#speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
# Subscribes to viseme received event
# The unit of evt.audio_offset is tick (1 tick = 100 nanoseconds), divide it by 10,000 to convert to milliseconds.
#speech_synthesizer.bookmark_reached.connect(lambda evt: print(
#"Bookmark reached: {}, audio offset: {}ms, bookmark text: {}.".format(evt, evt.audio_offset / 10000, evt.text)))
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
ssml = "<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' " \
"xmlns:mstts='http://www.w3.org/2001/mstts'>" \
"<voice name='Microsoft Server Speech Text to Speech Voice (en-US, DavisNeural)'>" \
"<mstts:viseme type='FacialExpression'/>" \
"Norway is a Scandinavian country located in Northern Europe. It is home to a population of 5.3 million people, and its capital city is Oslo. Norway is known for its stunning natural beauty, with mountains, fjords, and…" \
"</voice></speak> "
def viseme_cb(evt):
print("Viseme event received: audio offset: {}ms, viseme id: {}.".format(
evt.audio_offset / 10000, evt.viseme_id))
# `Animation` is an xml string for SVG or a json string for blend shapes
animation = evt.animation
print(animation)
# Subscribes to viseme received event
speech_synthesizer.viseme_received.connect(viseme_cb)
# If VisemeID is the only thing you want, you can also use `speak_text_async()`
result = speech_synthesizer.speak_ssml_async(ssml).get()
# Bookmark tag is needed in the SSML, e.g.
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized.")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
Could you cross check the same or try the above script with your resource key and region? The events should be available as JSON as mentioned in the referenced thread to process the same for animation.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.