Can Azure TTS API use both Custom neural voice and Facial position in BlendShapes?

Question

Can Azure TTS API use both Custom neural voice and Facial position in BlendShapes?

wave test 20

I know the Azure TTS API can Get facial position with viseme(3d BlendShapes)

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=visemeid&pivots=programming-language-javascript

,and also use Custom Neural Voice,

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#use-speaking-styles-and-roles

but is it possible to specify both Custom Neural Voice and return viseme data in one api call?

dupammi 8,615 Reputation points Microsoft External Staff

2023-09-25T11:00:28.15+00:00

Hi @wave test ,

Thank you for your question about the Azure TTS API to use both Custom Neural Voice and return viseme data. I will be happy to assist you regarding this.

Regarding your query, it is possible to use both Custom Neural Voice and return viseme data in one API call using the Azure TTS API.

To achieve this, you can use the "outputStyle" parameter in the API call to specify the output format. The "outputStyle" parameter can be set to "riff-16khz-16bit-mono-pcm" for audio output or "raw-16khz-16bit-mono-pcm" for viseme data output.

Here's an example Python code that I used at my end, to generate audio output and viseme data output using Custom Neural Voice:

Here is the step-by-step explanation of above Python screenshots where the code generates text-to-speech (TTS) audio using the Azure Cognitive Services Speech SDK and then extracting viseme data from the audio stream. Additionally, it shows how to use the extracted viseme data to animate a 3D model.

Import necessary modules:

azure.cognitiveservices.speech: This imports the Azure Cognitive Services Speech SDK for TTS. struct: This module is used for working with binary data. time: Used for introducing delays to simulate real-time animation. OpenGL.GL, OpenGL.GLU, OpenGL.GLUT: These modules are part of PyOpenGL, a Python wrapper for OpenGL, and are used for 3D graphics rendering.

Set up Azure Cognitive Services Speech configuration:

speech_key and service_region: These variables store your subscription key and the Azure region where the TTS service is hosted. custom_neural_voice_name: This variable specifies the custom neural voice to be used. viseme_output_format: This variable specifies the format of viseme data. input_text: This variable stores the text to be synthesized into speech.

Create a speech synthesis configuration: speech_config: This object is created with the subscription key and service region, and it configures the TTS service.

Configure the speech synthesis voice:

speech_config.speech_synthesis_voice_name: This sets the voice to be used for synthesis to the custom neural voice specified earlier. Create a speech synthesizer:

speech_synthesizer: This object is created using the speech configuration.

Generate speech:

speech_synthesizer.speak_text_async(input_text).get()

This generates the TTS output for the input text and stores the result in speech_synthesis_result.

Extract viseme data from the audio output:

This defines the format of viseme data (viseme_data_format) and its size (viseme_data_size).It processes the audio output stream to extract viseme data and appends it to the viseme_events list.

Print viseme events:

The code then iterates through viseme_events and prints the viseme ID, start time, and end time for each viseme event.

Animate a 3D model using viseme data:

It iterates through viseme_events again and extracts the viseme ID. If the viseme ID is found in viseme_animation_frames, it retrieves the corresponding animation frame data.

The code simulates setting the animation frame for a 3D model (here using OpenGL) based on the viseme ID and animation frame data. It uses glTranslatef to translate the model.

Simulate real-time animation:

time.sleep(0.1): This introduces a short delay to simulate real-time animation.

Please have a look into the below documentation for more details:

Azure_AI-Services_Speech-Service_How-to-speech-synthesis-viseme - Get facial position with viseme

Azure_AI_Services_Speech-service_custom_neural_voice

I hope this information helps!
dupammi 8,615 Reputation points Microsoft External Staff

2023-09-26T12:39:54.56+00:00

Hi @wave test ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.
祯贾 0 Reputation points

2024-03-20T02:10:58.4166667+00:00

Why the viseme id is so long?

Accepted answer

0 additional answers

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2023-09-26T12:39:54.56+00:00

Hi @wave test ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.
祯贾 0 Reputation points

2024-03-20T02:10:58.4166667+00:00

Why the viseme id is so long?

Answer 1

dupammi 8,615 Microsoft External Staff

@wave test ,

Following up to see my above "comment" answer helps by checking the comments section of this thread. Do let us know if you have any queries.

To reiterate the resolution here, let me jot down the gist of my comment answer above.

Yes, it is possible to use both Custom Neural Voice and return viseme data in one API call using the Azure TTS API.

To achieve this, you can use the "outputStyle" parameter in the API call to specify the output format. The "outputStyle" parameter can be set to "riff-16khz-16bit-mono-pcm" for audio output or "raw-16khz-16bit-mono-pcm" for viseme data output.

For a working sample python code and documentation, please refer comments section of this thread.

Please 'Accept as answer' and ‘Upvote’ if it helped so that it can help others in the community looking for help on similar topics.

wave test 20

Thanks a lot for giving such detailed code and corresponding instructions.

I mainly want to make sure that custom audio and Facial position can be returned at the same time.

 

 In the past few days I've already tested sending mstts express-as xml(mstts:viseme type="FacialExpression") with the built-in voice name specified in the python api, then getting the audio and Facial position in BlendShapes, then sending  via the live link protocol to  UE5 for metahuman rendering.

 The effect of English speech is very good, compared with Chinese mouth matching is not so excellent.

I'll take a look at the effect of customizing the voice later. Thank you very much for your reply, and Happy Mid-Autumn Festival!

Share via

Can Azure TTS API use both Custom neural voice and Facial position in BlendShapes?

0 additional answers

Your answer