How to Update Changes in a Vector Database for PDF Content?

Archana Chaudhary 0 Reputation points
2024-11-18T09:56:48.7833333+00:00

Here’s a clearer and more detailed version of your question for the Azure forums:


I have a PDF file whose content has already been embedded into vectors and stored in a vector database. Recently, there were some changes made to the PDF. I want to update the corresponding vectors in the vector database to reflect these changes.

What would be the best approach to efficiently update or replace the existing vectors in the database without causing inconsistencies? Are there any specific APIs, tools, or best practices available for this purpose when working with Azure services?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,092 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Archana Chaudhary 0 Reputation points
    2024-11-20T06:13:56.26+00:00

    Hi @Shree Hima Bindu Maganti
    Thanks for providing the solution approach.
    Currently, this is what the similar solution approach we had implemented before and it's in working state, but it's a bit time consuming.

    I was looking for the solution more related to the updated content from PDF.
    Consider while training the documents, which follows the process like:

    1. Passing the document files (PDF's) to Azure Doc Intelligence to get the text chunks
    2. Generate the Embeddings of text chunks and doc metadata to Vector Embeddings using Az OpenAI embeddings.
    3. Store these embeddings into vector DB (Azure AI Search).

    This is the pretty standard training process.
    In case of updated document files, I was looking into replacing the vector embeddings for updated content only.
    For example:
    If I have a PDF of 5 pages, and I am updating the content of page 2 and keeping all the content of other pages same, so while re-processing the doc file, I am expecting it should update the corresponding text chunks vector embeddings of only updated part of PDF file, keeping all the embeddings of other text chunks same as there is no updates in it.

    Also I am expecting the same scenario in case of doc file (CSV).

    Thanks in advance, appreciated your efforts.


  2. Shree Hima Bindu Maganti 1,065 Reputation points Microsoft Vendor
    2024-11-21T06:19:03.1633333+00:00

    Hi Archana Chaudhary ,
    Thankyou for your Response.
    To achieve the goal of updating only the changed parts of a document (PDF or CSV) while keeping the embeddings of unchanged portions intact.
    Chunk the Document into Segments:

    For PDFs: Use Azure Document Intelligence to split the document into chunks, such as pages or sections.

    For CSVs: Split the file by rows or columns.

    After identifying the changes (e.g., content changes on page 2 of a 5-page PDF), track the updated chunks (e.g., updated content on page 2).

    Extract the text of the updated chunks (e.g., page 2).

    Generate new embeddings for these updated chunks using Azure OpenAI embeddings.

    Delete outdated embeddings for the changed chunks using their unique chunk identifiers.

    Insert the new embeddings for the updated chunks.

    Leave Unchanged Chunks Intact

    Do not reprocess or update embeddings for unchanged chunks to avoid unnecessary computation and preserve efficiency.

    If only Page 2 is updated, only reprocess and update the embeddings for Page 2.

    Embeddings for Pages 1, 3, 4, and 5 remain unchanged.
    If the answer is helpful, please click "Accept Answer" and kindly upvote it.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.