I tried to upload a document as an Azure OpenAI Studio Chat Playground datasource and it only indexed part of it

Question

I tried to upload a document as an Azure OpenAI Studio Chat Playground datasource and it only indexed part of it

Alejandro Erickson 0

I added a datasource in the chat playground of Azure OpenAI Studio with the "upload files" option. I created a new Azure Cognitive Search Resource in the "Basic" tier, enabled vector embedding, and assigned a deployed ada 002 model to embed document chunks.

On the next screen I uploaded a 1.8mb text file of tab-delimited data, and selected "Vector" search (not Vector+simple and not Vector+semantic). It took 5 minutes to chunk and embed the document and seemed to finish succesfully. I can now get some answers from the document in the chat.

However, when I browse the index and search for certain things, it's obviously missing a lot of data. The index has 15 documents of about 5k characters each, while the original document has 1.8M characters, so a lot of it is missing. I double checked this by searching the index directly, and indeed many parts are missing.

What happened to the missing parts of the uploaded file and how can I get it to index the whole file? Is the playground simply limited to indexing 15 documents, or am I limited by the "Basic" tier search service?

Your help is appreciated, thanks!

User's image

Alejandro Erickson 0 Reputation points

2023-09-01T18:28:56.4733333+00:00

I found a workaround by dividing the document up into many small files and uploading those instead. It seems to index everything when I do this.
ajkuma 28,036 Reputation points Microsoft Employee Moderator

2023-09-03T04:18:37.19+00:00

Alejandro Erickson, Just checking in to see if you had got a chance to see the previous response. If the answer helped (pointed, you in the right direction) > please click Accept Answer Or please share the requested/more info to help you better.

1 answer

Your answer

Alejandro Erickson 0 Reputation points

2023-09-01T18:28:56.4733333+00:00

I found a workaround by dividing the document up into many small files and uploading those instead. It seems to index everything when I do this.
ajkuma 28,036 Reputation points Microsoft Employee Moderator

2023-09-03T04:18:37.19+00:00

Alejandro Erickson, Just checking in to see if you had got a chance to see the previous response. If the answer helped (pointed, you in the right direction) > please click Accept Answer Or please share the requested/more info to help you better.

Answer 1

@Alejandro Erickson ,

Thanks for the follow-up and sharing the workaround.

Based on my understanding of your scenario and the issue, it seems that the issue you are facing is related to the indexer limits of Azure Cognitive Search. As mentioned in this doc, Azure Cognitive Search imposes indexer limits on how much text it extracts depending on the pricing tier. A warning will appear in the indexer status response if documents are truncated.

For the Basic tier, it is 64,000 characters. Since your original document has 1.8M characters (which is much larger than the limit of the Basic tier), as you pointed out, it is likely that the indexer was not able to extract all the text from the document.

To avoid this issue, you can try breaking apart documents with large amounts of text into multiple, smaller documents ( as you figured out the workaround) or you can also try using a higher pricing tier, such as the Standard tier, which has a limit of 4 million characters

To index the whole file, you may need to upgrade to a higher tier that supports larger documents. You may check the Service Limits in Azure Cognitive Search doc to see the limits for each tier.

Reference : ( limits mentioned in this Azure doc, at this time of submitting this answer).

Indexers limit how much text can be extracted from any one document. This limit depends on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, 4 million for Standard, 8 million for Standard S2, and 16 million for Standard S3. Text that was truncated won't be indexed. To avoid this warning, try breaking apart documents with large amounts of text into multiple, smaller documents.

Share via

I tried to upload a document as an Azure OpenAI Studio Chat Playground datasource and it only indexed part of it

1 answer

Your answer