PDF Documents Not Fully Indexed by Azure Search

SwathiDhanwada-MSFT 18,911 Reputation points
2024-07-31T05:59:13.3+00:00

Why are my PDF documents not fully indexed by Azure Search indexer, even though no errors are raised?

PS - Based on common issues that we have seen from customers and other sources, we are posting these questions to help the Azure community.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,274 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. SwathiDhanwada-MSFT 18,911 Reputation points
    2024-07-31T06:00:01.8033333+00:00

    The problem arises due to the text extraction limits of the Azure Search indexer, which vary by pricing tier. Specifically, the Free tier allows for 32,000 characters, Basic tier for 64,000 characters, and Standard tiers from 4 million to 16 million characters. If your PDF documents contain more text than the allowed limit, the excess text is truncated and not indexed. This can lead to situations where the indexer processes the document without raising errors but fails to index all the content.

    To address this, you should split documents with large amounts of text into smaller documents. This ensures that each document remains within the character limit of your pricing tier, allowing all text to be properly indexed.

    For further details, you can refer to the Service limits in Azure AI Search documentation.

    Please do not forget to "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.