Azure AI Search Open AI chat completion return [doc1] in the content of the assistance response.
Hello everyone,
The issue:
I am using the Azure Open AI chat completion with my own data retrieved from Azure Cognitive Search (newly named AI search) in order to create a chat bot application with RAG capabilities.
The problem I am facing is that the answer that I am getting from the chat completion api contains chunks of text refering to [doc1] regardless of the document being referenced (see below + note that I don't have any document called doc1).
This is what I am refering to:
"... Dissertation Fellowship. [doc1]"
This is the entire object of the message array:
{"index": 1,
"role": "assistant",
"content": "Based on the retrieved document, Gloria Gonzalez is a Ph.D. holder in Spanish (US Hispanic Literature) from the University of Houston. She is an adjunct lecturer at the University of Houston's Department of Hispanic Studies, where she teaches courses such as Mexican-American Literature, Women in Hispanic Literature, and Spanish-American Short Story. She has published several peer-reviewed articles and is the author of the book \"Quixote Reborn: The Wanderer in US Hispanic Literature,\" which is forthcoming from Yale University Press. She has also presented at various conferences, including the Hispanic Storytelling Association Annual Conference and the US Hispanic Literature Annual Conference. Additionally, she has received several honors and awards, including the UH Teaching Awards and the Dissertation Fellowship. [doc1]",
"end_turn": true }
The goal:
My goal is basically to be able to retrieve the documents name instead of having the [doc1].
Additional information on Chat completion API:
Here is the URI: <my endpoint>/openai/deployments/<my deployment name>/extensions/chat/completions?api-version=2023-06-01-preview
Here is the request body that I use:
{
"temperature": 0,
"max_tokens": 1000,
"top_p": 1.0,
"dataSources": [
{
"type": "AzureCognitiveSearch",
"parameters": {
"endpoint": "<my end point>",
"key": "<my key>",
"indexName": "<my index name>"
}
}
],
"messages": [
{"role": "user","content": "Who is Gloria"}
]
}
Here is the complete response that I get:
{
"id": "<>",
"model": "gpt-35-turbo",
"created": 1701778584,
"object": "chat.completion",
"choices": [
{
"index": 0,
"messages": [
{
"index": 0,
"role": "tool",
"content": "{\"citations\": [{\"content\": \"Gloria Gonzalez\\n3204 Windover Way\\nHoustonFemenina Hispánica\\nModern Languages Association\\n\\nGloriaGonzalezCV.docx\", \"id\": null, \"title\": null, \"filepath\": null, \"url\": null, \"metadata\": {\"chunking\": \"orignal document size=580. Scores=0.5272721Org Highlight count=7.\"}, \"chunk_id\": \"0\"}], \"intent\": \"[\\\"Who is Gloria?\\\"]\"}",
"end_turn": false
},
{
"index": 1,
"role": "assistant",
"content": "Based on the retrieved document, Gloria Gonzalez is a Ph.D. holder in Spanish (US Hispanic Literature) from the University of Houston. She is an adjunct lecturer at the University of Houston's Department of Hispanic Studies, where she teaches courses such as Mexican-American Literature, Women in Hispanic Literature, and Spanish-American Short Story. She has published several peer-reviewed articles and is the author of the book \"Quixote Reborn: The Wanderer in US Hispanic Literature,\" which is forthcoming from Yale University Press. She has also presented at various conferences, including the Hispanic Storytelling Association Annual Conference and the US Hispanic Literature Annual Conference. Additionally, she has received several honors and awards, including the UH Teaching Awards and the Dissertation Fellowship. [doc1]",
"end_turn": true
}
]
}
],
"usage": {
"prompt_tokens": 3937,
"completion_tokens": 157,
"total_tokens": 4094
}
}
I am using the following microsoft documentation:
- "Azure OpenAI Service REST API reference": https://learn.microsoft.com/en-us/azure/ai-services/openai/reference
- "Index data from SharePoint document libraries": https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online
Additional information on Azure cognitive search set up (AI search):
Index API body:
{
"name" : "<my sharepoint index name>",
"fields": [
{ "name": "id", "type": "Edm.String", "key": true, "searchable": false },
{ "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
{ "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
{ "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
{ "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
]
}
indexer API body:
{
"name" : "<my sharepoint indexer name>",
"dataSourceName" : "<my sharepiont datasource name>",
"targetIndexName" : "<my sharepoint index name>",
"parameters": {
"batchSize": null,
"maxFailedItems": null,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"indexedFileNameExtensions" : ".pdf, .docx, .pptx, .xlsx",
"excludedFileNameExtensions" : ".png, .jpg",
"dataToExtract": "contentAndMetadata"
}
},
"schedule" : {"interval" : "PT5M"},
"fieldMappings" : [
{
"sourceFieldName" : "metadata_spo_site_library_item_id",
"targetFieldName" : "id",
"mappingFunction" : {
"name" : "base64Encode"
}
}
]
}
Final word
Thank you for your help.
Let me know if you need any additional information.
Thank you,
Anis