Text split cognitive skill

The Text Split skill breaks text into chunks of text. You can specify whether you want to break the text into sentences or into pages of a particular length. This skill is especially useful if there are maximum text length requirements in other skills downstream.

Note

This skill isn't bound to Azure AI services. It's non-billable and has no Azure AI services key requirement.

@odata.type

Microsoft.Skills.Text.SplitSkill

Skill Parameters

Parameters are case-sensitive.

Parameter name Description
textSplitMode Either pages or sentences. Pages have a configurable maximum length, but the skill attempts to avoid truncating a sentence so the actual length might be smaller. Sentences are a string that terminates at sentence-ending punctuation, such as a period, question mark, or exclamation point, assuming the language has sentence-ending punctuation.
maximumPageLength Only applies if textSplitMode is set to pages. This parameter refers to the maximum page length in characters as measured by String.Length. The minimum value is 300, the maximum is 50000, and the default value is 5000. The algorithm does its best to break the text on sentence boundaries, so the size of each chunk might be slightly less than maximumPageLength.
pageOverlapLength Only applies if textSplitMode is set to pages. Each page starts with this number of characters from the end of the previous page. If this parameter is set to 0, there's no overlapping text on successive pages. This parameter is supported in 2023-10-01-Preview REST API and in Azure SDK beta packages that have been updated to support integrated vectorization. This example includes the parameter.
maximumPagesToTake Only applies if textSplitMode is set to pages. Number of pages to return. The default is 0, which means to return all pages. You should set this value if only a subset of pages are needed. This parameter is supported in 2023-10-01-Preview REST API and in Azure SDK beta packages that have been updated to support integrated vectorization. This example includes the parameter.
defaultLanguageCode (optional) One of the following language codes: am, bs, cs, da, de, en, es, et, fr, he, hi, hr, hu, fi, id, is, it, ja, ko, lv, no, nl, pl, pt-PT, pt-BR, ru, sk, sl, sr, sv, tr, ur, zh-Hans. Default is English (en). A few things to consider:
  • Providing a language code is useful to avoid cutting a word in half for nonwhitespace languages such as Chinese, Japanese, and Korean.
  • If you don't know the language in advance (for example, if you're using the LanguageDetectionSkill to detect language), we recommend the en default.

Skill Inputs

Parameter name Description
text The text to split into substring.
languageCode (Optional) Language code for the document. If you don't know the language of the text inputs (for example, if you're using LanguageDetectionSkill to detect the language), you can omit this parameter. If you set languageCode to a language isn't in the supported list for the defaultLanguageCode, a warning is emitted and the text isn't split.

Skill Outputs

Parameter name Description
textItems Output is an array of substrings that were extracted. textItems is the default name of the output. targetName is optional, but if you have multiple Text Split skills, make sure to set targetName so that you don't overwrite the data from the first skill with the second one. If targetName is set, use it in output field mappings or in downstream skills that use the skill output.

Sample definition

{
    "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
    "textSplitMode" : "pages", 
    "maximumPageLength": 1000,
    "defaultLanguageCode": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "/document/language"
        }
    ],
    "outputs": [
        {
            "name": "textItems",
            "targetName": "mypages"
        }
    ]
}

Sample input

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "text": "This is the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia...",
                "languageCode": "en"
            }
        },
        {
            "recordId": "2",
            "data": {
                "text": "This is the second document, which will be broken into several pages...",
                "languageCode": "en"
            }
        }
    ]
}

Sample output

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "textItems": [
                    "This is the loan...",
                    "In the next section, we continue..."
                ]
            }
        },
        {
            "recordId": "2",
            "data": {
                "textItems": [
                    "This is the second document...",
                    "In the next section of the second doc..."
                ]
            }
        }
    ]
}

Example for chunking and vectorization

This example is for integrated vectorization, currently in preview. It adds preview-only parameters to the sample definition, and shows the resulting output.

  • pageOverlapLength: Overlapping text is useful in data chunking scenarios because it preserves continuity between chunks generated from the same document.

  • maximumPagesToTake: Limits on page intake are useful in vectorization scenarios because it helps you stay under the maximum input limits of the embedding models providing the vectorization.

Sample definition

This definition adds pageOverlapLength of 100 characters and maximumPagesToTake of one.

Assuming the maximumPageLength is 5,000 characters (the default), then "maximumPagesToTake": 1 processes the first 5,000 characters of each source document.

This example sets textItems to myPages through targetName. Because targetName is set, myPages is the value you should use to select the output from the Text Split skill. Use /document/mypages/* in downstream skills, indexer output field mappings, knowledge store projections, and index projections.

{
    "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
    "textSplitMode" : "pages", 
    "maximumPageLength": 1000,
    "pageOverlapLength": 100,
    "maximumPagesToTake": 1,
    "defaultLanguageCode": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "/document/language"
        }
    ],
    "outputs": [
        {
            "name": "textItems",
            "targetName": "mypages"
        }
    ]
}

Sample input (same as previous example)

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "text": "This is the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia...",
                "languageCode": "en"
            }
        },
        {
            "recordId": "2",
            "data": {
                "text": "This is the second document, which will be broken into several sections...",
                "languageCode": "en"
            }
        }
    ]
}

Sample output (notice the overlap)

Within each "textItems" array, trailing text from the first item is copied into the beginning of the second item.

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "textItems": [
                    "This is the loan...Here is the overlap part",
                    "Here is the overlap part...In the next section, we continue..."
                ]
            }
        },
        {
            "recordId": "2",
            "data": {
                "textItems": [
                    "This is the second document...Here is the overlap part...",
                    "Here is the overlap part...In the next section of the second doc..."
                ]
            }
        }
    ]
}

Error cases

If a language isn't supported, a warning is generated.

See also