AI Search Service Indexer: Is there a max size of one file/citation?

Question

I have a bunch of files in a container, and I am searching through those files with a Search Service, and displaying the results on my Web App.

Some files are quite big and I've noticed that the algorithm chunks them up in multiple citations:

User's image

How does it decide to chunk them up? After how many characters?

Accepted Answer

Hi @Dirk Broenink @Dirk Broenink thanks for the question.

When using Azure AI Search Service, the chunking of documents is determined by several factors and techniques.

Here’s a summary of how it works:

Common Chunking Techniques: The service can use fixed-size chunks, variable-sized chunks based on content, or a combination of both. Fixed-size chunks might be defined by a certain number of words or tokens, allowing for some overlap to preserve context
Content Overlap: A small amount of text overlap between chunks is recommended to help maintain context. The starting point for testing could be an overlap of approximately 10%
Factors for Chunking: Considerations include the maximum token input limits of embedding models, the type of data, and the specific use case

Please note that the exact number of characters after which chunking occurs can vary based on the chunking technique used and the specific configuration of your indexer. For more detailed guidance, you may want to refer to the Azure documentation on chunking large documents

The chunking behavior you observed, where large files are split into multiple citations, is expected and is part of the Azure Cognitive Search indexer's normal operation when handling large files, as described in the Indexing Blob Data documentation.

For large documents, Azure AI Search Service might chunk them into multiple citations to ensure that each chunk stays under the maximum input size imposed by the models used for indexing. If you’re using integrated vectorization, which is currently in preview, it offers internal data chunking and embedding

If you need to adjust the chunk size or optimize the indexing process for your specific requirements, you can refer to the Indexer Configuration section and consider scaling up your Azure Cognitive Search service, as outlined in the Choose a SKU tier documentation.

Hope that helps.

Best,

Grace

AI Search Service Indexer: Is there a max size of one file/citation?

0 additional answers