Cognitive Search Storage overhead for large fields

Sanjay Chouhan 0 Reputation points Microsoft Employee
2023-05-26T09:11:40.02+00:00

Why/how does the storage cost of Cognitive Search blow up when one of the retrieval fields holds a large amount of data?

Example scenario:

  • pdf & csv files with 100 GB storage
  • Search all fields
  • Retrieval of only 3 fields
  • One retrieval field has 95% of the data (large field)

The Cognitive Search storage overhead nearly doubles in size for 100 GB worth of files

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,354 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. brtrach-MSFT 17,741 Reputation points Microsoft Employee Moderator
    2023-05-26T23:41:10.3933333+00:00

    When you index data in Azure Cognitive Search, the service creates an inverted index that maps terms to the documents that contain them. The index is used to quickly find documents that match a search query. The size of the index depends on the amount of text that is indexed, and the number of unique terms in that text.

    When you retrieve a field that contains a large amount of data, the size of the index can increase significantly. This is because the inverted index needs to store information about every term in the field, and every document that contains those terms. If the field is very large, this can result in a large number of terms, and a large number of documents that contain those terms.

    In your example scenario, if one of the retrieval fields has 95% of the data, then the inverted index for that field will be much larger than the inverted index for the other fields. This can result in a significant increase in storage overhead.

    To reduce the storage overhead, you can consider reducing the amount of data that is stored in the large field. For example, you could store the data in a separate storage account, and only store a reference to the data in the search index. Alternatively, you could split the large field into smaller fields, and only retrieve the fields that are needed for a particular query.

    I hope this helps! Let me know if you have any other questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.