It is possible to chunk documents before uploading them to Blob storage using a script. One approach could be to use Azure Functions to trigger the script whenever a new document is uploaded to Blob storage. The script can then convert the document to PDF, chunk the PDF file, and upload the chunks to Blob storage. This approach can be extended to other sources like Sharepoint by configuring the trigger to monitor multiple storage accounts.
Here are the general steps to implement this approach:
- Create an Azure Function with a Blob storage trigger that listens for new documents uploaded to Blob storage.
- Configure the function to use the appropriate input and output bindings for Blob storage.
- Write a script that converts the document to PDF, chunks the PDF file, and uploads the chunks to Blob storage.
- Deploy the script to the Azure Function.
- Test the function by uploading a document to Blob storage and verifying that the chunks are created and uploaded to Blob storage.
It is important to note that the chunk size should be optimized for the models used for summarization and embeddings, and that overlapping strategies can help preserve context and semantic richness for queries that target specific information. Additionally, some libraries that provide chunking include LangChain and Semantic Kernel.
References: