How do you set document chunk length and overlap when creating a chatbot that uses your own data?

Bao, Jeremy (Cognizant) 105 Reputation points
2024-02-21T00:02:05.4033333+00:00

I think that when you upload documents to make a chatbot that can draw on information you provide, Azure automatically chops those documents into non-overlapping chunks of 1,024 tokens, where relevant chunks will be fetched to provide context to answer user questions. LangChain's RAG functionality allows you to specify the size of the document chunks and the degree to which they should overlap.

Can this be done in Azure OpenAI services? If so, how? Not allowing overlap may cause one section of a JSON object to be separated from the start of that JSON object, in which case the information in the latter half of the object would be lost.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,202 questions
{count} votes

1 answer

Sort by: Most helpful
  1. navba-MSFT 24,890 Reputation points Microsoft Employee
    2024-02-21T04:12:43.2733333+00:00

    @Bao, Jeremy (Cognizant) Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

    Why is chunking important?

    The models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the Azure OpenAI embedding models is 8,191 tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries.

    Content overlap considerations

    When you chunk data, overlapping a small amount of text between chunks can help preserve context. We recommend starting with an overlap of approximately 10%. For example, given a fixed chunk size of 256 tokens, you would begin testing with an overlap of 25 tokens. The actual amount of overlap varies depending on the type of data and the specific use case, but we have found that 10-15% works for many scenarios.

    Sentence chunking with "10% overlap"

    In this you create an overlap between chunks according to certain ratio. A 10% overlap on maximum tokens of 10 is one token. See the details here.

    If you want to try the Chunking and vector embedding generation sample. Refer this.

    Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.