How to design a pipeline that split PDF or Word documents to be indexed in a vector database ?

Question

How to design a pipeline that split PDF or Word documents to be indexed in a vector database ?

Tamer Abdulghani 50

We are trying to implement a pipeline that load files (PDFs, Word) from an azure storage data lake, split those documents into pages (maybe), then store the final pages in another storage account. Whenever there will be a new document coming, this must trigger the process of splitting.

What azure services can be use to implement this pipeline ?

Is azure functions suitable for this purpose or should we go with Azure Data Factory ?

This piece should be part of LLM Architecture, so the files must be indexed finally in a vector database

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-09-25T23:03:59.2333333+00:00

@Tamer Abdulghani Just checking in to see if the below information was helpful. If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. If you have any further query, do let us know.

Thank you

Accepted answer

0 additional answers

Your answer

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-09-25T23:03:59.2333333+00:00

@Tamer Abdulghani Just checking in to see if the below information was helpful. If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. If you have any further query, do let us know.

Thank you

Answer 1

It's either you use the Azure Event Grid, which can detect when a new file is uploaded to Azure Data Lake Storage and can trigger downstream processes like Azure Functions or Logic Apps.

Or Azure Functions since they are serverless compute services they can enable you to run event-driven code in response to a variety of events (in your case an event-driven processing on new document upload), The function can be triggered by the Event Grid and can process the uploaded PDF or Word document.

For splitting and reading PDFs, you can go for libraries like PyPDF2 (for Python) or PDFBox (for Java) can be used and for Word documents, python-docx (for Python) or Apache POI (for Java) are good choices.

After processing the documents, the split pages can be stored in Azure Blob Storage for further indexing or retrieval and once the document is split, you'd want to convert the content into vectors. If you want to use Azure-specific services, Azure Cognitive Search can be an option.

I think ADF might be overkill for your use case unless you have additional data transformation and integration needs that aren’t mentioned.

Share via

How to design a pipeline that split PDF or Word documents to be indexed in a vector database ?

0 additional answers

Your answer