Cognitive Search Storage overhead for large fields

Question

Cognitive Search Storage overhead for large fields

Sanjay Chouhan 0 Microsoft Employee

Why/how does the storage cost of Cognitive Search blow up when one of the retrieval fields holds a large amount of data?

Example scenario:

pdf & csv files with 100 GB storage
Search all fields
Retrieval of only 3 fields
One retrieval field has 95% of the data (large field)

The Cognitive Search storage overhead nearly doubles in size for 100 GB worth of files

1 answer

Your answer

Answer 1

When you index data in Azure Cognitive Search, the service creates an inverted index that maps terms to the documents that contain them. The index is used to quickly find documents that match a search query. The size of the index depends on the amount of text that is indexed, and the number of unique terms in that text.

When you retrieve a field that contains a large amount of data, the size of the index can increase significantly. This is because the inverted index needs to store information about every term in the field, and every document that contains those terms. If the field is very large, this can result in a large number of terms, and a large number of documents that contain those terms.

In your example scenario, if one of the retrieval fields has 95% of the data, then the inverted index for that field will be much larger than the inverted index for the other fields. This can result in a significant increase in storage overhead.

To reduce the storage overhead, you can consider reducing the amount of data that is stored in the large field. For example, you could store the data in a separate storage account, and only store a reference to the data in the search index. Alternatively, you could split the large field into smaller fields, and only retrieve the fields that are needed for a particular query.

I hope this helps! Let me know if you have any other questions.

Share via

Cognitive Search Storage overhead for large fields

1 answer

Your answer