Hi @Dirk Broenink@Dirk Broenink thanks for the question.
When using Azure AI Search Service, the chunking of documents is determined by several factors and techniques.
Here’s a summary of how it works:
- Common Chunking Techniques: The service can use fixed-size chunks, variable-sized chunks based on content, or a combination of both. Fixed-size chunks might be defined by a certain number of words or tokens, allowing for some overlap to preserve context
- Content Overlap: A small amount of text overlap between chunks is recommended to help maintain context. The starting point for testing could be an overlap of approximately 10%
- Factors for Chunking: Considerations include the maximum token input limits of embedding models, the type of data, and the specific use case
Please note that the exact number of characters after which chunking occurs can vary based on the chunking technique used and the specific configuration of your indexer. For more detailed guidance, you may want to refer to the Azure documentation on chunking large documents
The chunking behavior you observed, where large files are split into multiple citations, is expected and is part of the Azure Cognitive Search indexer's normal operation when handling large files, as described in the Indexing Blob Data documentation.
For large documents, Azure AI Search Service might chunk them into multiple citations to ensure that each chunk stays under the maximum input size imposed by the models used for indexing. If you’re using integrated vectorization, which is currently in preview, it offers internal data chunking and embedding
If you need to adjust the chunk size or optimize the indexing process for your specific requirements, you can refer to the Indexer Configuration section and consider scaling up your Azure Cognitive Search service, as outlined in the Choose a SKU tier documentation.
Hope that helps.
Best,
Grace