I tried to upload a document as an Azure OpenAI Studio Chat Playground datasource and it only indexed part of it

Alejandro Erickson 0 Reputation points
2023-08-31T18:32:39.0133333+00:00

I added a datasource in the chat playground of Azure OpenAI Studio with the "upload files" option. I created a new Azure Cognitive Search Resource in the "Basic" tier, enabled vector embedding, and assigned a deployed ada 002 model to embed document chunks.

On the next screen I uploaded a 1.8mb text file of tab-delimited data, and selected "Vector" search (not Vector+simple and not Vector+semantic). It took 5 minutes to chunk and embed the document and seemed to finish succesfully. I can now get some answers from the document in the chat.

However, when I browse the index and search for certain things, it's obviously missing a lot of data. The index has 15 documents of about 5k characters each, while the original document has 1.8M characters, so a lot of it is missing. I double checked this by searching the index directly, and indeed many parts are missing.

What happened to the missing parts of the uploaded file and how can I get it to index the whole file? Is the playground simply limited to indexing 15 documents, or am I limited by the "Basic" tier search service?

Your help is appreciated, thanks!

User's image

User's image

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,275 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,922 questions
{count} votes

1 answer

Sort by: Most helpful
  1. ajkuma 27,946 Reputation points Microsoft Employee
    2023-09-01T20:18:04.9466667+00:00

    @Alejandro Erickson ,

    Thanks for the follow-up and sharing the workaround.

    Based on my understanding of your scenario and the issue, it seems that the issue you are facing is related to the indexer limits of Azure Cognitive Search. As mentioned in this doc, Azure Cognitive Search imposes indexer limits on how much text it extracts depending on the pricing tier. A warning will appear in the indexer status response if documents are truncated.

    For the Basic tier, it is 64,000 characters. Since your original document has 1.8M characters (which is much larger than the limit of the Basic tier), as you pointed out, it is likely that the indexer was not able to extract all the text from the document.

    To avoid this issue, you can try breaking apart documents with large amounts of text into multiple, smaller documents ( as you figured out the workaround) or you can also try using a higher pricing tier, such as the Standard tier, which has a limit of 4 million characters

    To index the whole file, you may need to upgrade to a higher tier that supports larger documents. You may check the Service Limits in Azure Cognitive Search doc to see the limits for each tier.

    Reference : ( limits mentioned in this Azure doc, at this time of submitting this answer).

    Indexers limit how much text can be extracted from any one document. This limit depends on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, 4 million for Standard, 8 million for Standard S2, and 16 million for Standard S3. Text that was truncated won't be indexed. To avoid this warning, try breaking apart documents with large amounts of text into multiple, smaller documents.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.