Why SpeechSynthesisVisemeEventArgs.Animation is empty string?

林老师 0 Reputation points
2023-04-25T02:54:51.1233333+00:00

My code is like this:

using (var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig))
{
    // Subscribes to viseme received event
    synthesizer.VisemeReceived += (s, e) =>
    {
        Console.WriteLine($"Viseme event received. Audio offset: " +
            $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}.");

        var animation = e.Animation;// I get emtpy string
    };

    // If VisemeID is the only thing you want, you can also use `SpeakTextAsync()`
    var result = await synthesizer.SpeakSsmlAsync(ssml);
}

and i get emtpy string from e.Animation

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,069 questions
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator
    2023-04-25T10:40:34.2633333+00:00

    @林老师 Are you using the sample from the quickstart as mentioned in the documentation?

    I think the main reason that the result could be empty is because the SSML does not contain the element with following tag:

    <mstts:viseme type='FacialExpression'/>
    
    

    I have worked with one user from a previous issue where the events were available for the viseme_received event of the synthesizer. Here is a snippet of the code used to print the events which was later used by the user to generate animation.

    
    import azure.cognitiveservices.speech as speechsdk
    
        # Creates an instance of a speech config with specified subscription key and service region.
    speech_config = speechsdk.SpeechConfig(subscription='<your_key>', region='<your_region>')
    
        # Creates a speech synthesizer with a null output stream.
        # This means the audio output data will not be written to any output channel.
        # You can just get the audio from the result.
        #speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
    
        # Subscribes to viseme received event
        # The unit of evt.audio_offset is tick (1 tick = 100 nanoseconds), divide it by 10,000 to convert to milliseconds.
        #speech_synthesizer.bookmark_reached.connect(lambda evt: print(
            #"Bookmark reached: {}, audio offset: {}ms, bookmark text: {}.".format(evt, evt.audio_offset / 10000, evt.text)))
    
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
    
    ssml = "<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' " \
               "xmlns:mstts='http://www.w3.org/2001/mstts'>" \
               "<voice name='Microsoft Server Speech Text to Speech Voice (en-US, DavisNeural)'>" \
                "<mstts:viseme type='FacialExpression'/>" \
                    "Norway is a Scandinavian country located in Northern Europe. It is home to a population of 5.3 million people, and its capital city is Oslo. Norway is known for its stunning natural beauty, with mountains, fjords, and…" \
                     "</voice></speak> "
    
    def viseme_cb(evt):
        print("Viseme event received: audio offset: {}ms, viseme id: {}.".format(
            evt.audio_offset / 10000, evt.viseme_id))
    
        # `Animation` is an xml string for SVG or a json string for blend shapes
        animation = evt.animation
        print(animation)
    
    # Subscribes to viseme received event
    speech_synthesizer.viseme_received.connect(viseme_cb)
    
    # If VisemeID is the only thing you want, you can also use `speak_text_async()`
    result = speech_synthesizer.speak_ssml_async(ssml).get()
    
        # Bookmark tag is needed in the SSML, e.g.
    
    
    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesized.")
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print("Speech synthesis canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
    
    

    Could you cross check the same or try the above script with your resource key and region? The events should be available as JSON as mentioned in the referenced thread to process the same for animation.

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.