I need to create a real-time 2D conversational avatar using custom portrait image

SpandanB 0 Reputation points
2024-06-07T09:35:14.63+00:00

I want to create a webpage with a conversational 2D AI avatar using cognitive services with custom portrait image. Is it possible with Azure ?

Azure AI Bot Service
Azure AI Bot Service
An Azure service that provides an integrated environment for bot development.
765 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,025 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,508 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 47,421 Reputation points
    2024-06-10T02:19:46.8+00:00

    Hello @SpandanB

    Thanks for reaching out to us, I think you want to do a AI avatar which will act the conversation? Is this correct?

    If yes, you can connect the two APIs to accomplish the target - Azure OpenAI and Azure Speech.

    For conversational, the first part will be a chat bot, which you can leverage Language Service or OpenAI Service.

    https://learn.microsoft.com/en-us/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line%2Cpython-new&pivots=programming-language-studio

    For Avatar, you can consider Text to Speech feature.

    If yes, please check on below document about how to make a "conversational" avatar, it can be 3D or 2D.- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448

    What is text to speech avatar?

    The text to speech avatar system is a text to speech feature with vision capabilities, that allow customers to create synthetic videos of a 2D photorealistic avatar speaking. The Neural text to speech Avatar models are trained by deep neural networks based on the human video recording samples, and the voice of the avatar is provided by text to speech voice model.  

     

    Why do we build avatars? There are two main reasons:

    • Traditional video content creation requires a lot of time and budget, including setting up video shooting environment, filming videos, editing, etc. With text to speech avatar, users can more efficiently create video. Users can use the avatar to build training videos, product introductions, customer testimonials, etc., simply with text input. 
    • With the release of Azure OpenAI Service and neural text to speech, interactive conversation is more natural than before. With text to speech avatar, the users can create more engaging digital interactions. You can use the avatar to build conversational agents, virtual assistants, chatbots, and more. 

    There are three components in an avatar content generation workflow: text analyzer, the TTS audio synthesizer, and TTS avatar video synthesizer. To generate avatar video, text is first input into the text analyzer, which provides the output in the form of phoneme sequence. Then, the TTS audio synthesizer predicts the acoustic features of the input text and synthesize the voice. These two parts are provided by text to speech voice models. Next, the Neural text to speech Avatar model predicts the image of lip sync with the acoustic features, so that the synthetic video is generated.  

    Below is an overview of the workflow: 

    thumbnail image 1 of blog post titled  Azure AI Speech announces public preview of text to speech avatar

    Please take a look and have a try, I hope this helps.

    Regards,

    Yutong

    -Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

    0 comments No comments