Automated Chunking before Uploading to Blob Storage

johananmahendran 120 Reputation points
2023-08-21T06:46:29.84+00:00

Hi,

I am currently implementing a Retrieval Augmented Generation (RAG) pattern for my data.

Within the pipeline, data is first uploaded to Blob storage and then automatically indexed by a Cognitive Search indexer.

The files uploaded to Blob storage can be .pdf, .jpeg, .docx and .xlsx

The file size can be larger -- sometimes exceeding the Cognitive Search indexer limit. And I would need to chunk the documents.

As of now, I would need to prechunk the data manually before uploading to Blob storage. This involve running a script on my local machine which converting the documents to pdf, chunks the pdf file and then uploads it to Blob storage.

In the future, I plan on connecting other sources, like Sharepoint, to the Blob Storage. And this uploading may be done automatically.

Is there a way to chunk any documents to the Blob storage before uploading? I was thinking of maybe running a preamble script whenever Blob storage receives a new document, but I am not sure how to go about it or what services are available for me to do this.

Thanks.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,141 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Q&A Assist
    2023-08-21T06:46:29.84+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    It is possible to chunk documents before uploading them to Blob storage using a script. One approach could be to use Azure Functions to trigger the script whenever a new document is uploaded to Blob storage. The script can then convert the document to PDF, chunk the PDF file, and upload the chunks to Blob storage. This approach can be extended to other sources like Sharepoint by configuring the trigger to monitor multiple storage accounts.

    Here are the general steps to implement this approach:

    1. Create an Azure Function with a Blob storage trigger that listens for new documents uploaded to Blob storage.
    2. Configure the function to use the appropriate input and output bindings for Blob storage.
    3. Write a script that converts the document to PDF, chunks the PDF file, and uploads the chunks to Blob storage.
    4. Deploy the script to the Azure Function.
    5. Test the function by uploading a document to Blob storage and verifying that the chunks are created and uploaded to Blob storage.

    It is important to note that the chunk size should be optimized for the models used for summarization and embeddings, and that overlapping strategies can help preserve context and semantic richness for queries that target specific information. Additionally, some libraries that provide chunking include LangChain and Semantic Kernel.


    References:

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.