Text split cognitive skill
Viktig
Some parameters are in public preview under Supplemental Terms of Use. The preview REST API supports these parameters.
The Text Split skill breaks text into chunks of text. You can specify whether you want to break the text into sentences or into pages of a particular length. This skill is useful if there are maximum text length requirements in other skills downstream, such as embedding skills that pass data chunks to embedding models on Azure OpenAI and other model providers. For more information about this scenario, see Chunk documents for vector search.
Several parameters are version-specific. The skills parameter table notes the API version in which a parameter was introduced so that you know whether a version upgrade is required. To use version-specific features such as token chunking in 2024-09-01-preview, you can use the Azure portal, or target a REST API version, or check an Azure SDK change log to see if it supports the feature.
The Azure portal supports most preview features and can be used to create or update a skillset. For updates to the Text Split skill, edit the skillset JSON definition to add new preview parameters.
Obs!
This skill isn't bound to Azure AI services. It's non-billable and has no Azure AI services key requirement.
Microsoft.Skills.Text.SplitSkill
Parameters are case-sensitive.
Parameter name | Version | Description |
---|---|---|
textSplitMode |
All versions | Either pages or sentences . Pages have a configurable maximum length, but the skill attempts to avoid truncating a sentence so the actual length might be smaller. Sentences are a string that terminates at sentence-ending punctuation, such as a period, question mark, or exclamation point, assuming the language has sentence-ending punctuation. |
maximumPageLength |
All versions | Only applies if textSplitMode is set to pages . For unit set to characters , this parameter refers to the maximum page length in characters as measured by String.Length . The minimum value is 300, the maximum is 50000, and the default value is 5000. The algorithm does its best to break the text on sentence boundaries, so the size of each chunk might be slightly less than maximumPageLength . For unit set to azureOpenAITokens , the maximum page length is the token length limit of the model. For text embedding models, a general recommendation for page length is 512 tokens. |
defaultLanguageCode |
All versions | (optional) One of the following language codes: am, bs, cs, da, de, en, es, et, fr, he, hi, hr, hu, fi, id, is, it, ja, ko, lv, no, nl, pl, pt-PT, pt-BR, ru, sk, sl, sr, sv, tr, ur, zh-Hans . Default is English (en). A few things to consider:
|
pageOverlapLength |
2024-07-01 | Only applies if textSplitMode is set to pages . Each page starts with this number of characters or tokens from the end of the previous page. If this parameter is set to 0, there's no overlapping text on successive pages. This example includes the parameter. |
maximumPagesToTake |
2024-07-01 | Only applies if textSplitMode is set to pages . Number of pages to return. The default is 0, which means to return all pages. You should set this value if only a subset of pages are needed. This example includes the parameter. |
unit |
2024-09-01-preview | New. Only applies if textSplitMode is set to pages . Specifies whether to chunk by characters (default) or azureOpenAITokens . Setting the unit affects maximumPageLength and pageOverlapLength . |
azureOpenAITokenizerParameters |
2024-09-01-preview | New. An object providing extra parameters for the azureOpenAITokens unit. encoderModelName is a designated tokenizer used for converting text into tokens, essential for natural language processing (NLP) tasks. Different models use different tokenizers. Valid values include cl100k_base (default) used by GPT-35-Turbo and GPT-4. Other valid values are r50k_base, p50k_base, and p50k_edit. The skill implements the tiktoken library by way of SharpToken and Microsoft.ML.Tokenizers but doesn't support every encoder. For example, there's currently no support for o200k_base encoding used by GPT-4o. allowedSpecialTokens defines a collection of special tokens that are permitted within the tokenization process. Special tokens are string that you want to treat uniquely, ensuring they aren't split during tokenization. For example ["[START"], "[END]"]. For languages in which the tiktoken library is not performing the tokenization as expected, it's recommended to use text splitting instead. |
Parameter name | Description |
---|---|
text |
The text to split into substring. |
languageCode |
(Optional) Language code for the document. If you don't know the language of the text inputs (for example, if you're using LanguageDetectionSkill to detect the language), you can omit this parameter. If you set languageCode to a language isn't in the supported list for the defaultLanguageCode , a warning is emitted and the text isn't split. |
Parameter name | Description |
---|---|
textItems |
Output is an array of substrings that were extracted. textItems is the default name of the output. targetName is optional, but if you have multiple Text Split skills, make sure to set targetName so that you don't overwrite the data from the first skill with the second one. If targetName is set, use it in output field mappings or in downstream skills that consume the skill output, such as an embedding skill. |
{
"name": "SplitSkill",
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"description": "A skill that splits text into chunks",
"context": "/document",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"unit": "azureOpenAITokens",
"azureOpenAITokenizerParameters":{
"encoderModelName":"cl100k_base",
"allowedSpecialTokens": [
"[START]",
"[END]"
]
},
"maximumPageLength": 512,
"inputs": [
{
"name": "text",
"source": "/document/text"
},
{
"name": "languageCode",
"source": "/document/language"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
]
}
{
"values": [
{
"recordId": "1",
"data": {
"text": "This is the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia...",
"languageCode": "en"
}
},
{
"recordId": "2",
"data": {
"text": "This is the second document, which will be broken into several pages...",
"languageCode": "en"
}
}
]
}
{
"values": [
{
"recordId": "1",
"data": {
"textItems": [
"This is the loan...",
"In the next section, we continue..."
]
}
},
{
"recordId": "2",
"data": {
"textItems": [
"This is the second document...",
"In the next section of the second doc..."
]
}
}
]
}
This example is for integrated vectorization.
pageOverlapLength
: Overlapping text is useful in data chunking scenarios because it preserves continuity between chunks generated from the same document.maximumPagesToTake
: Limits on page intake are useful in vectorization scenarios because it helps you stay under the maximum input limits of the embedding models providing the vectorization.
This definition adds pageOverlapLength
of 100 characters and maximumPagesToTake
of one.
Assuming the maximumPageLength
is 5,000 characters (the default), then "maximumPagesToTake": 1
processes the first 5,000 characters of each source document.
This example sets textItems
to myPages
through targetName
. Because targetName
is set, myPages
is the value you should use to select the output from the Text Split skill. Use /document/mypages/*
in downstream skills, indexer output field mappings, knowledge store projections, and index projections.
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"textSplitMode" : "pages",
"maximumPageLength": 1000,
"pageOverlapLength": 100,
"maximumPagesToTake": 1,
"defaultLanguageCode": "en",
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "languageCode",
"source": "/document/language"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "mypages"
}
]
}
{
"values": [
{
"recordId": "1",
"data": {
"text": "This is the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia...",
"languageCode": "en"
}
},
{
"recordId": "2",
"data": {
"text": "This is the second document, which will be broken into several sections...",
"languageCode": "en"
}
}
]
}
Within each "textItems" array, trailing text from the first item is copied into the beginning of the second item.
{
"values": [
{
"recordId": "1",
"data": {
"textItems": [
"This is the loan...Here is the overlap part",
"Here is the overlap part...In the next section, we continue..."
]
}
},
{
"recordId": "2",
"data": {
"textItems": [
"This is the second document...Here is the overlap part...",
"Here is the overlap part...In the next section of the second doc..."
]
}
}
]
}
If a language isn't supported, a warning is generated.