Hello @Huiyuan SUN
Thanks for reaching out to us, yes, there are AI services and APIs that can drive digital humans using audio input. One example is the Azure Cognitive Services Speech Services, which includes a Speech-to-Text API that can transcribe audio input into text. This text can then be used to drive a digital human's facial expressions or gestures.
Currently, Azure Speech Services supports 2D and 3D animation.
For 2D characters, you can design a character that suits your scenario and use Scalable Vector Graphics (SVG) for each viseme ID to get a time-based face position.
You can use blend shapes to drive the facial movements of a 3D character that you designed.
The blend shapes JSON string is represented as a 2-dimensional matrix. Each row represents a frame. Each frame (in 60 FPS) contains an array of 55 facial positions.
For more information about how to do it in Azure speech Services, please refer to below document -
Another example is the Azure Kinect Sensor SDK, which includes a Body Tracking API that can track a person's movements and gestures in real-time. This information can be used to drive a digital human's movements and gestures.
There are also third-party tools and platforms that specialize in creating and animating digital humans, such as Reallusion's iClone and Unreal Engine's MetaHuman Creator. These tools often include AI-driven features for generating facial expressions and animations based on audio input.
I hope this helps.
Regards,
Yutong
-Please kindly accept the answer and vote 'Yes' if you feel helpful to support the community, thanks a lot.