Image RAG, Does it exist? where to start?

Mohamed Hussein 650 Reputation points
2025-03-07T04:38:21.24+00:00

Good Day,

I've set of images, each image has description, caption

When user enters prompts, I need to have two assistants (if that feasible),

  1. Reply with the normal Chat Completiions
  2. If user prompts matches with any image description / caption, system shall send it to user as an additional complations

Is such apprach exist? what do you call it?

Where to start?

Thank you,

Mohamed Hussein

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,236 questions
0 comments No comments
{count} votes

Accepted answer
  1. Prashanth Veeragoni 1,520 Reputation points Microsoft External Staff
    2025-03-07T06:32:07.15+00:00

    Hi Mohamed Hussein,

    Welcome to Microsoft Q&A forum. Thanks for posting your query.

    Your approach is related to Retrieval-Augmented Generation (RAG) for Images, which can be termed "Image RAG" or Multimodal RAG. However, traditional RAG primarily deals with text retrieval, and for images, a Hybrid Multimodal Retrieval System is needed.

    Yes, an Image RAG (Multimodal RAG) approach exists. You can implement it using Azure OpenAI + Vector Search. Start by generating embeddings for images, store them in a vector database, retrieve relevant matches for queries, and return a hybrid response (text + images).

    Solution Approach

    1. Setup Azure OpenAI Chat Model

    Deploy GPT-4 Turbo for text-based responses using Azure OpenAI.

    Implement an API endpoint using FastAPI or Flask.

    1. Generate Image Embeddings

    Use Azure Cognitive Services Vision API or CLIP to extract embeddings for images.

    Store these embeddings along with image URLs in Azure Cognitive Search / Pinecone / FAISS.

    1. Implement Retrieval Logic

    When a user enters a prompt, generate the embedding and search for nearest matches in the vector database.

    1. Combine & Send Response

    If a match is found, return both text completion + matched image(s).

    If no match is found, return only text completion.

    In Detail Explanation:

    Step 1: Prepare Image Data

    Collect images and their corresponding descriptions/captions.

    Use CLIP (Contrastive Language-Image Pretraining) or OpenAI’s GPT-4V (Vision model) to generate vector embeddings for images.

    Store these embeddings in a Vector Database (Azure Cognitive Search, FAISS, or Pinecone).

    Step 2: Process User Prompt

    When a user submits a prompt, first send it to Azure OpenAI Chat Completion for a standard response.

    Convert the prompt into an embedding using the same model (CLIP, OpenAI Embeddings API).

    Step 3: Retrieve Matching Images

    Search the Vector Database for the closest matching images using cosine similarity between the user’s query embedding and stored image embeddings.

    If a match is found (above a certain similarity threshold), retrieve the image and its description.

    Step 4: Respond with Both Text & Images

    Return a combined response:

    Text Response from Chat Completion.

    Relevant Image(s) with their description, if a match is found.

    Also, please refer to below documents

    RAG Overview: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/retrieval-augmented-generation

    RAG Implementation: https://learn.microsoft.com/en-us/training/modules/use-own-data-azure-openai/

    Hope this helps. Do let us know if you any further queries.   

    ------------- 

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.