Hi @Kento Kawasaki,
Azure AI Search does not inherently split documents into chunks unless configured to do so. However, when integrated with certain libraries or tools, automatic chunking can occur based on predefined settings. For instance, the Text Split skill in Azure AI Search can partition documents into smaller sections based on parameters like textSplitMode
, maximumPageLength
, and pageOverlapLength
. These settings control how documents are divided, often to optimize processing and retrieval.
The type of characters in your document (e.g., full-width Hiragana vs. half-width ASCII) can affect tokenization. Full-width characters may result in higher token counts, potentially triggering chunking mechanisms designed to manage large token volumes.
Tools like LlamaIndex
or LangChain
may have default chunking behaviors to handle large documents effectively. These defaults can lead to automatic splitting if not explicitly configured otherwise.
To keep each document as a single, unchanged chunk, consider the following approaches:
- Check the library configuration settings you are using (e.g.,
LlamaIndex
,LangChain
) for any default chunking modes. Turn off automatic chunking by adjusting these settings. For instance,LangChain
provides text splitters that can be configured or disabled based on your requirements. - If you're using the Text Split skill in Azure AI Search, make sure that its parameters are set so that they can hold your document sizes without invoking automatic chunking. Having
textSplitMode
set to pages with a large enoughmaximumPageLength
can help maintain documents as single chunks. - Be mindful of the token limits set by embedding models. As an example, the Azure OpenAI text-embedding-ada-002 model has an 8,191 token max. Keeping your documents within such limits can prevent the need for chunking.
Kindly refer to the below documentations for better understanding:
https://video2.skills-academy.com/en-us/azure/search/vector-search-how-to-chunk-documents https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization
If the answer is helpful, please click Accept Answer and kindly upvote it so that other people who faces similar issue may get benefitted from it.
Let me know if you have any further Queries.