AI Search Service Indexer: Is there a max size of one file/citation?

Dirk Broenink 85 Reputation points
2024-04-25T14:58:46.69+00:00

I have a bunch of files in a container, and I am searching through those files with a Search Service, and displaying the results on my Web App.

Some files are quite big and I've noticed that the algorithm chunks them up in multiple citations:

User's image

How does it decide to chunk them up? After how many characters?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
720 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
970 questions
0 comments No comments
{count} votes

Accepted answer
  1. Grmacjon-MSFT 16,186 Reputation points
    2024-04-26T00:15:26.3366667+00:00

    Hi @Dirk Broenink@Dirk Broenink thanks for the question.

    When using Azure AI Search Service, the chunking of documents is determined by several factors and techniques.

    Here’s a summary of how it works:

    • Common Chunking Techniques: The service can use fixed-size chunks, variable-sized chunks based on content, or a combination of both. Fixed-size chunks might be defined by a certain number of words or tokens, allowing for some overlap to preserve context
    • Content Overlap: A small amount of text overlap between chunks is recommended to help maintain context. The starting point for testing could be an overlap of approximately 10%
    • Factors for Chunking: Considerations include the maximum token input limits of embedding models, the type of data, and the specific use case

    Please note that the exact number of characters after which chunking occurs can vary based on the chunking technique used and the specific configuration of your indexer. For more detailed guidance, you may want to refer to the Azure documentation on chunking large documents

    The chunking behavior you observed, where large files are split into multiple citations, is expected and is part of the Azure Cognitive Search indexer's normal operation when handling large files, as described in the Indexing Blob Data documentation.

    For large documents, Azure AI Search Service might chunk them into multiple citations to ensure that each chunk stays under the maximum input size imposed by the models used for indexing. If you’re using integrated vectorization, which is currently in preview, it offers internal data chunking and embedding

    If you need to adjust the chunk size or optimize the indexing process for your specific requirements, you can refer to the Indexer Configuration section and consider scaling up your Azure Cognitive Search service, as outlined in the Choose a SKU tier documentation.

    Hope that helps.

    Best,

    Grace


0 additional answers

Sort by: Most helpful