Regarding the automatic chunk splitting behavior in Azure AI Search without explicit configuration

Kento Kawasaki 20 Reputation points
2025-03-25T01:31:40.9733333+00:00

Background

Currently, we are using Streamlit along with LlamaIndex libraries to store documents into Azure AI Search. Our configuration is as follows:


vector_store = AzureAISearchVectorStore(
    search_or_index_client=index_client,
    filterable_metadata_field_keys={},
    index_name=index_name,
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="chunk",
    embedding_field_key="text_vector",
    embedding_dimensionality=3072,  # depends on embedding model
    metadata_string_field_key="metadata",
    doc_id_field_key="doc_id",
    language_analyzer="ja.microsoft",
)

We are currently experiencing an issue where documents stored in Azure AI Search are automatically split into chunks, despite not explicitly specifying any chunk size or character count.

Desired Outcome

We aim to store each document as a single, unaltered chunk within Azure AI Search. Since automatic chunking is undesirable in our use case, we wish to disable or clearly understand the limits that trigger this automatic chunking behavior.

Observed Phenomenon and Investigation Results

We conducted research to determine why chunk splitting occurs, but neither official documentation nor related resources provided clear rules or restrictions regarding chunk splitting.

We also considered whether the 1024-character limit on document keys might be causing this issue, but since our key is set to id, we concluded that this restriction is unlikely to be relevant.

To further investigate, we conducted the following tests:

Test 1: Storing 5000 full-width Hiragana characters (Japanese characters)

We stored a single document consisting of 5000 full-width hiragana characters. Upon checking, we found the document had been automatically divided into multiple chunks. We randomly selected 4 chunks to measure their token counts:

Chunk 1: Token count: 973

Chunk 2: Token count: 974

Chunk 3: Token count: 974

Chunk 4: Token count: 972

From these results, we hypothesized that Azure AI Search internally splits documents to ensure token counts remain under approximately 1000 tokens.

Test 2: Storing 5000 ASCII (half-width) characters

When we stored a single document consisting of 5000 half-width (ASCII) characters, the entire document was stored within a single chunk. This indicates that the behavior of automatic chunk splitting may also depend on character type (full-width vs. half-width).

Clarification Needed

Considering our findings, we have the following specific questions:

Are there explicit conditions (such as token limits, character limits, or byte limits) that trigger Azure AI Search's automatic chunk splitting behavior?

Does Azure AI Search's chunk splitting logic depend on token counts or the type of characters (full-width vs. half-width)?

If Azure AI Search internally splits chunks based on token counts or character counts, could you provide the exact numeric limitations or restrictions?

If there is a method to disable automatic chunk splitting or intentionally store the entirety of a document in a single chunk, could you please provide guidance on how to achieve this?

We are currently unable to identify the specifications or appropriate actions due to limited information. Therefore, detailed and specific responses would be greatly appreciated.

Thank you very much for your support.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,351 questions
0 comments No comments
{count} votes

Accepted answer
  1. Bhargavi Naragani 6,055 Reputation points Microsoft External Staff Moderator
    2025-03-25T19:41:57.6633333+00:00

    Hi @Kento Kawasaki,

    Azure AI Search does not inherently split documents into chunks unless configured to do so. However, when integrated with certain libraries or tools, automatic chunking can occur based on predefined settings. For instance, the Text Split skill in Azure AI Search can partition documents into smaller sections based on parameters like textSplitMode, maximumPageLength, and pageOverlapLength. These settings control how documents are divided, often to optimize processing and retrieval.

    The type of characters in your document (e.g., full-width Hiragana vs. half-width ASCII) can affect tokenization. Full-width characters may result in higher token counts, potentially triggering chunking mechanisms designed to manage large token volumes.​

    Tools like LlamaIndex or LangChain may have default chunking behaviors to handle large documents effectively. These defaults can lead to automatic splitting if not explicitly configured otherwise.

    To keep each document as a single, unchanged chunk, consider the following approaches:

    1. Check the library configuration settings you are using (e.g., LlamaIndex, LangChain) for any default chunking modes. Turn off automatic chunking by adjusting these settings. For instance, LangChain provides text splitters that can be configured or disabled based on your requirements. ​
    2. If you're using the Text Split skill in Azure AI Search, make sure that its parameters are set so that they can hold your document sizes without invoking automatic chunking. Having textSplitMode set to pages with a large enough maximumPageLength can help maintain documents as single chunks. ​
    3. Be mindful of the token limits set by embedding models. As an example, the Azure OpenAI text-embedding-ada-002 model has an 8,191 token max. Keeping your documents within such limits can prevent the need for chunking.

    Kindly refer to the below documentations for better understanding:
    https://video2.skills-academy.com/en-us/azure/search/vector-search-how-to-chunk-documents https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization

    If the answer is helpful, please click Accept Answer and kindly upvote it so that other people who faces similar issue may get benefitted from it.

    Let me know if you have any further Queries.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.