Need Guidance to Build a Voice-Enabled AI Agent That Supports Image-Based Responses and Feedback

Pragadish 20 Reputation points
2025-06-05T07:20:01.9433333+00:00

Hi everyone!

I'm new to Microsoft Azure and currently exploring what's possible using Copilot Studio. So far, I’ve built a basic conversational agent and connected Azure AI Search with vectorized data from my Blob Storage. The documents I’ve uploaded (PDFs) include work instructions—like wiring setups, camera installations, etc.

Right now, my agent works great for answering text-based questions using the document content. But I want to take it further:

Image Retrieval – If the PDF contains images relevant to the user's question, I want the agent to also return those images along with the answer. Is this even possible? If yes, how should I go about it?

User Image Feedback – I want the user to upload an image (e.g., their wiring setup), and then ask the agent whether it’s correct. Ideally, the agent should look at the image and respond like, “You’ve missed connecting that cable,” etc. Any ideas on how to implement this kind of visual validation?

Voice Support – I’d love to add voice capabilities to my agent so users can interact via speech. What are the best options in Azure for integrating this?

Any guidance, architecture ideas, or tools I should explore would be super helpful!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,326 questions
0 comments No comments
{count} votes

Accepted answer
  1. Prashanth Veeragoni 4,765 Reputation points Microsoft External Staff Moderator
    2025-06-05T09:15:40.4933333+00:00

    Hi Pragadish,

    You’re thinking in exactly the right direction for building a multimodal, voice-enabled AI assistant using Azure’s powerful ecosystem! Let me break down your three goals and how to achieve them using Azure tools like Azure AI Studio (Copilot Studio), Azure AI Vision, Azure AI Speech, and OpenAI models.

    1. Image Retrieval from PDF via AI Agent (User asks question → Image + Answer)

    Yes, it is possible, though not out-of-the-box. You’ll need to extract the images and their context during your indexing process and make them retrievable by your agent.

    How to implement:

    Step1: Extract Images and Text from PDF

    Use Azure Document Intelligence or Azure Form Recognizer:

    ·   Upload PDFs

    ·   Extract:

    o   Text content → for vectorization and Azure AI Search

    o   Images → store them separately (e.g., in Azure Blob Storage)

    o   Mapping → map text sections to image file references

    Step2: Index Data Intelligently

    ·   Store extracted image references (e.g., image_id_123.jpg) alongside associated text in Azure AI Search or your vector store.

    ·   When a user asks a question, the search results can return both:

    o   Text answer

    o   Related image URL (hosted in blob or CDN)

    Step3: Enhance Chat Agent

    In Azure AI Studio, customize your Grounded Copilot or RAG pipeline:

    ·   When generating answers, include image references or markdown to render:

    ![Cable Setup](https://yourstorage.blob.core.windows.net/setup/image123.jpg)
    

    2. User Image Feedback → "Is My Setup Correct?"

    Is it possible?

    Yes, but it requires a custom image classification or object detection model, which you can integrate into your Azure agent.

    Approach:

    Option A: Use Azure AI Vision (Custom Vision)

    ·   Train a Custom Vision model:

    o   Upload images of correct vs. incorrect setups (labeled dataset)

    o   Train it to classify or detect issues (e.g., “Cable not connected”)

    Option B: Use GPT-4o / GPT-4-V with Vision Capabilities (Azure OpenAI)

    ·   GPT-4o (multimodal model) supports image + text input

    ·   Workflow:

    o   User uploads an image (wiring setup)

    o   Send image + prompt (“Is this wiring setup correct?”) to GPT-4o endpoint via OpenAI API

    o   It gives a natural-language reply: “The blue wire is not connected to the top pin.”

    GPT-4o is best for zero-shot reasoning and works well when labeled training data is scarce.

    3. Voice Support for the Agent

    How to add voice interaction:

    Use Azure Speech Services (Speech-to-Text and Text-to-Speech)

    ·   Integrate Speech SDK in your web or mobile client

    ·   Flow:

    o   User speaks → converted to text

    o   Text sent to the agent (chat interface)

    o   Agent replies → convert text back to speech

    ·   Tools:

    o   Azure Speech SDK (JavaScript / Python / C#)

    o   Azure Bot Framework SDK (if extending Copilot with bot logic)

    o   Integrate with Mic + Speaker in browser/mobile

    Combine with Copilot Studio:

    Use Azure Communication Services (ACS) to build a voice-based chat app

    Full Architecture Overview:

    graph TD
        A[User (Voice/Image/Question)] --> B[Frontend UI]
        B --> |Voice| C[Azure Speech to Text]
        B --> |Image Upload| D[Azure Blob Storage]
        C --> E[Copilot Agent / Azure AI Studio]
        D --> F[Vision Model (Custom Vision or GPT-4o Vision)]
        E --> G[Azure AI Search with Indexed Text+Images]
        E --> H[Azure OpenAI (GPT-4 / GPT-4o)]
        H --> I[Answer + Image References]
        F --> H
        H --> J[Azure Text-to-Speech]
        J --> B
        I --> B
    

    Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

    **

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    Thank you! 

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.