Share via

Disallow automatic chunking

Anonymous
2024-11-28T10:04:28.81+00:00

When using JSON files as a data source, I want these files to not be chunked automatically by the Azure open ai service (the files have a reasonable length which should easily fit in the context size of the open ai models).

  1. Where does this chunking happen? (when I retrieve my files in the Azure portal, search service, go to the respective index, and search with "*", I get the complete files, unchunked) But when I use the chatbot and click on a reference given by the chatbot, the files are chunked into very small pieces. Where in the workflow does this happen?
  2. How can I disallow the chunking? I want to use the unchunked files as input to the openai model because I notice with chunked files that the model hallucinates about the MISSING parts of the document (i.e. the other, missing, not retrieved chunks). I assume these hallucinations would be less of a problem if the complete, unchunked file would be given as input to the model. Furthermore, I need the reference displayed to be unchunked to be useful because my files contain important information at the beginning and at the end.

Thank you!

Azure OpenAI in Foundry Models
0 comments No comments

2 answers

Sort by: Most helpful
  1. Anonymous
    2025-02-05T10:21:52.8066667+00:00

    Hi @Pavankumar Purilla

    Sorry for coming back to you. It still does not work for our use case at the moment since your answers are not clear enough for us, sorry! We would need more concrete help to stay with Microsoft Azure. Please try it out on your own if it works for you. If not, please also let me know! Exact code for the index, indexer, and skillset would be very helpful. Thank you!

    I think concrete help would be necessary for the Azure Portal side: In Azure Portal:

    Azure blob container: The json files used as input look like this, with long content (but still reasonable, i.e. 1-2 A4 pages for example)

    {
        "chunk_id": 10,
        "content": "LONG CONTENT COMES HERE",
        "filepath": "TITLE_10",
        "last_updated": "20241127173113",
        "title": "TITLE_10"
    }
    

    Skillset:

    I tried out a skillset to set the maximumPageLength variable to the highest possible value (50000). I checked that all my documents are BELOW this lenght, so in theory should not be chunked:

    {
      "@odata.etag": "TAG",
      "name": "skillset-prevent-chunking",
      "description": "Skillset to prevent unwanted chunking of documents whose content is below 50,000 characters.",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "name": "TextSplitSkill",
          "description": "Splits content into large chunks only if needed. With maximumPageLength set high, documents below this length remain unsplit.",
          "context": "/document",
          "defaultLanguageCode": "en",
          "textSplitMode": "pages",
          "maximumPageLength": 50000,
          "pageOverlapLength": 0,
          "maximumPagesToTake": 0,
          "unit": "characters",
          "inputs": [
            {
              "name": "text",
              "source": "/document/content",
              "inputs": []
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "unsplit_content"
            }
          ]
        }
      ]
    }
    

    Indexer:

    {
      "@odata.context": "CONTEXT",
      "@odata.etag": "\"TAG\"",
      "name": "indexer-no-chunking",
      "description": null,
      "dataSourceName": "new-datasource-switzerland-json",
      "skillsetName": "skillset-prevent-chunking",
      "targetIndexName": "index-no-chunking",
      "disabled": null,
      "schedule": null,
      "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration": {
          "indexedFileNameExtensions": ".json",
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "json"
        }
      },
      "fieldMappings": [
        {
          "sourceFieldName": "filepath",
          "targetFieldName": "filepath",
          "mappingFunction": null
        },
        {
          "sourceFieldName": "title",
          "targetFieldName": "title",
          "mappingFunction": null
        },
        {
          "sourceFieldName": "chunk_id",
          "targetFieldName": "chunk_id",
          "mappingFunction": null
        },
        {
          "sourceFieldName": "last_updated",
          "targetFieldName": "last_updated",
          "mappingFunction": null
        }
      ],
      "outputFieldMappings": [
        {
          "sourceFieldName": "/document/unsplit_content",
          "targetFieldName": "unsplit_content",
          "mappingFunction": null
        },
        {
          "sourceFieldName": "/document/chunk_id",
          "targetFieldName": "id",
          "mappingFunction": null
        }
      ],
      "cache": null,
      "encryptionKey": null
    }
    

    Index:

    {
      "@odata.etag": "ETAG",
      "name": "index-no-chunking",
      "fields": [
        {
          "name": "id",
          "type": "Edm.String",
          "searchable": false,
          "filterable": true,
          "retrievable": true,
          "stored": true,
          "sortable": true,
          "facetable": false,
          "key": true,
          "synonymMaps": []
        },
        {
          "name": "unsplit_content",
          "type": "Edm.String",
          "searchable": true,
          "filterable": false,
          "retrievable": true,
          "stored": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "synonymMaps": []
        },
        {
          "name": "filepath",
          "type": "Edm.String",
          "searchable": false,
          "filterable": false,
          "retrievable": true,
          "stored": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "synonymMaps": []
        },
        {
          "name": "title",
          "type": "Edm.String",
          "searchable": true,
          "filterable": false,
          "retrievable": true,
          "stored": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "synonymMaps": []
        },
        {
          "name": "chunk_id",
          "type": "Edm.String",
          "searchable": false,
          "filterable": false,
          "retrievable": true,
          "stored": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "synonymMaps": []
        },
        {
          "name": "last_updated",
          "type": "Edm.String",
          "searchable": false,
          "filterable": false,
          "retrievable": true,
          "stored": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "synonymMaps": []
        }
      ],
      "scoringProfiles": [],
      "corsOptions": {
        "allowedOrigins": [
          "*"
        ]
      },
      "suggesters": [],
      "analyzers": [],
      "normalizers": [],
      "tokenizers": [],
      "tokenFilters": [],
      "charFilters": [],
      "similarity": {
        "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
      }
    }
    

    For your info: In Azure OpenAI Studio: Very basic setup here: We only add Data with Azure AI Search as Data Source and select our Azure AI Search index we created in the Azure Portal (index-no-chunking in this case).

    What I observe in the Azure Portal is that the unsplit_content field is as expected when I search the index (e.g. with the command * as query). However, in the chatbot citations, the documents are again split (or chunked) and I cannot control the length of these citations and I don't know where this happens!

    Best, Tim

    Was this answer helpful?

    0 comments No comments

  2. Pavankumar Purilla 11,575 Reputation points Microsoft External Staff Moderator
    2024-11-28T20:09:36.15+00:00

    Hi Ehrensperger Tim,
    Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

    The chunking of JSON files in Azure OpenAI Service typically happens either during the indexing process in Azure Cognitive Search or query time when documents are split into smaller chunks to fit the model’s token limit. While querying the index with * shows unchunked files, the chatbot retrieves smaller chunks to construct responses.

    To prevent chunking, ensure that your files are indexed as single units by storing the full content in one field (e.g., content), and configure your chatbot’s retrieval workflow to fetch entire documents without splitting. In Azure AI Studio or your integration pipeline, adjust retrieval and formatting logic to pass unchunked files to the model, ensuring references and context remain intact, which can reduce hallucinations and improve relevance.

    Hope this helps. Do let us know if you have any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.