An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Sorry for coming back to you. It still does not work for our use case at the moment since your answers are not clear enough for us, sorry! We would need more concrete help to stay with Microsoft Azure. Please try it out on your own if it works for you. If not, please also let me know! Exact code for the index, indexer, and skillset would be very helpful. Thank you!
I think concrete help would be necessary for the Azure Portal side: In Azure Portal:
Azure blob container: The json files used as input look like this, with long content (but still reasonable, i.e. 1-2 A4 pages for example)
{
"chunk_id": 10,
"content": "LONG CONTENT COMES HERE",
"filepath": "TITLE_10",
"last_updated": "20241127173113",
"title": "TITLE_10"
}
Skillset:
I tried out a skillset to set the maximumPageLength variable to the highest possible value (50000). I checked that all my documents are BELOW this lenght, so in theory should not be chunked:
{
"@odata.etag": "TAG",
"name": "skillset-prevent-chunking",
"description": "Skillset to prevent unwanted chunking of documents whose content is below 50,000 characters.",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "TextSplitSkill",
"description": "Splits content into large chunks only if needed. With maximumPageLength set high, documents below this length remain unsplit.",
"context": "/document",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 50000,
"pageOverlapLength": 0,
"maximumPagesToTake": 0,
"unit": "characters",
"inputs": [
{
"name": "text",
"source": "/document/content",
"inputs": []
}
],
"outputs": [
{
"name": "textItems",
"targetName": "unsplit_content"
}
]
}
]
}
Indexer:
{
"@odata.context": "CONTEXT",
"@odata.etag": "\"TAG\"",
"name": "indexer-no-chunking",
"description": null,
"dataSourceName": "new-datasource-switzerland-json",
"skillsetName": "skillset-prevent-chunking",
"targetIndexName": "index-no-chunking",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": null,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"indexedFileNameExtensions": ".json",
"dataToExtract": "contentAndMetadata",
"parsingMode": "json"
}
},
"fieldMappings": [
{
"sourceFieldName": "filepath",
"targetFieldName": "filepath",
"mappingFunction": null
},
{
"sourceFieldName": "title",
"targetFieldName": "title",
"mappingFunction": null
},
{
"sourceFieldName": "chunk_id",
"targetFieldName": "chunk_id",
"mappingFunction": null
},
{
"sourceFieldName": "last_updated",
"targetFieldName": "last_updated",
"mappingFunction": null
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/unsplit_content",
"targetFieldName": "unsplit_content",
"mappingFunction": null
},
{
"sourceFieldName": "/document/chunk_id",
"targetFieldName": "id",
"mappingFunction": null
}
],
"cache": null,
"encryptionKey": null
}
Index:
{
"@odata.etag": "ETAG",
"name": "index-no-chunking",
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"stored": true,
"sortable": true,
"facetable": false,
"key": true,
"synonymMaps": []
},
{
"name": "unsplit_content",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"stored": true,
"sortable": false,
"facetable": false,
"key": false,
"synonymMaps": []
},
{
"name": "filepath",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"stored": true,
"sortable": false,
"facetable": false,
"key": false,
"synonymMaps": []
},
{
"name": "title",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"stored": true,
"sortable": false,
"facetable": false,
"key": false,
"synonymMaps": []
},
{
"name": "chunk_id",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"stored": true,
"sortable": false,
"facetable": false,
"key": false,
"synonymMaps": []
},
{
"name": "last_updated",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"stored": true,
"sortable": false,
"facetable": false,
"key": false,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": {
"allowedOrigins": [
"*"
]
},
"suggesters": [],
"analyzers": [],
"normalizers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
}
}
For your info: In Azure OpenAI Studio: Very basic setup here: We only add Data with Azure AI Search as Data Source and select our Azure AI Search index we created in the Azure Portal (index-no-chunking in this case).
What I observe in the Azure Portal is that the unsplit_content field is as expected when I search the index (e.g. with the command * as query). However, in the chatbot citations, the documents are again split (or chunked) and I cannot control the length of these citations and I don't know where this happens!
Best, Tim