How to design a pipeline that split PDF or Word documents to be indexed in a vector database ?

Tamer Abdulghani 50 Reputation points
2023-09-20T12:34:46.6933333+00:00

We are trying to implement a pipeline that load files (PDFs, Word) from an azure storage data lake, split those documents into pages (maybe), then store the final pages in another storage account. Whenever there will be a new document coming, this must trigger the process of splitting.

What azure services can be use to implement this pipeline ?

Is azure functions suitable for this purpose or should we go with Azure Data Factory ?

This piece should be part of LLM Architecture, so the files must be indexed finally in a vector database

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,652 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 34,101 Reputation points Volunteer Moderator
    2023-09-20T14:59:42.9366667+00:00

    It's either you use the Azure Event Grid, which can detect when a new file is uploaded to Azure Data Lake Storage and can trigger downstream processes like Azure Functions or Logic Apps.

    Or Azure Functions since they are serverless compute services they can enable you to run event-driven code in response to a variety of events (in your case an event-driven processing on new document upload), The function can be triggered by the Event Grid and can process the uploaded PDF or Word document.

    For splitting and reading PDFs, you can go for libraries like PyPDF2 (for Python) or PDFBox (for Java) can be used and for Word documents, python-docx (for Python) or Apache POI (for Java) are good choices.

    After processing the documents, the split pages can be stored in Azure Blob Storage for further indexing or retrieval and once the document is split, you'd want to convert the content into vectors. If you want to use Azure-specific services, Azure Cognitive Search can be an option.

    I think ADF might be overkill for your use case unless you have additional data transformation and integration needs that aren’t mentioned.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.