Why SpeechSynthesisVisemeEventArgs.Animation is empty string?

Question

Why SpeechSynthesisVisemeEventArgs.Animation is empty string?

林老师 0

My code is like this:

using (var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig))
{
    // Subscribes to viseme received event
    synthesizer.VisemeReceived += (s, e) =>
    {
        Console.WriteLine($"Viseme event received. Audio offset: " +
            $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}.");

        var animation = e.Animation;// I get emtpy string
    };

    // If VisemeID is the only thing you want, you can also use `SpeakTextAsync()`
    var result = await synthesizer.SpeakSsmlAsync(ssml);
}

and i get emtpy string from e.Animation

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-04-26T07:48:53.1266667+00:00

@林老师 Did my previous response help to recheck if the animation events are available?

1 answer

Your answer

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-04-26T07:48:53.1266667+00:00

@林老师 Did my previous response help to recheck if the animation events are available?

Answer 1

@林老师 Are you using the sample from the quickstart as mentioned in the documentation?

I think the main reason that the result could be empty is because the SSML does not contain the element with following tag:

<mstts:viseme type='FacialExpression'/>

I have worked with one user from a previous issue where the events were available for the viseme_received event of the synthesizer. Here is a snippet of the code used to print the events which was later used by the user to generate animation.


import azure.cognitiveservices.speech as speechsdk

    # Creates an instance of a speech config with specified subscription key and service region.
speech_config = speechsdk.SpeechConfig(subscription='<your_key>', region='<your_region>')

    # Creates a speech synthesizer with a null output stream.
    # This means the audio output data will not be written to any output channel.
    # You can just get the audio from the result.
    #speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)

    # Subscribes to viseme received event
    # The unit of evt.audio_offset is tick (1 tick = 100 nanoseconds), divide it by 10,000 to convert to milliseconds.
    #speech_synthesizer.bookmark_reached.connect(lambda evt: print(
        #"Bookmark reached: {}, audio offset: {}ms, bookmark text: {}.".format(evt, evt.audio_offset / 10000, evt.text)))

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)

ssml = "<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' " \
           "xmlns:mstts='http://www.w3.org/2001/mstts'>" \
           "<voice name='Microsoft Server Speech Text to Speech Voice (en-US, DavisNeural)'>" \
            "<mstts:viseme type='FacialExpression'/>" \
                "Norway is a Scandinavian country located in Northern Europe. It is home to a population of 5.3 million people, and its capital city is Oslo. Norway is known for its stunning natural beauty, with mountains, fjords, and…" \
                 "</voice></speak> "

def viseme_cb(evt):
    print("Viseme event received: audio offset: {}ms, viseme id: {}.".format(
        evt.audio_offset / 10000, evt.viseme_id))

    # `Animation` is an xml string for SVG or a json string for blend shapes
    animation = evt.animation
    print(animation)

# Subscribes to viseme received event
speech_synthesizer.viseme_received.connect(viseme_cb)

# If VisemeID is the only thing you want, you can also use `speak_text_async()`
result = speech_synthesizer.speak_ssml_async(ssml).get()

    # Bookmark tag is needed in the SSML, e.g.


if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized.")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech synthesis canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))

Could you cross check the same or try the above script with your resource key and region? The events should be available as JSON as mentioned in the referenced thread to process the same for animation.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Why SpeechSynthesisVisemeEventArgs.Animation is empty string?

1 answer

Your answer