How to Search Specific Content in Multi-Page PDF Documents in Azure AI Search

Question

How to Search Specific Content in Multi-Page PDF Documents in Azure AI Search

SwathiDhanwada-MSFT 18,996 Moderator

How can I perform a search on specific content within multi-page PDF documents stored in SharePoint using Azure AI Search, and what are the limitations?

PS - Based on common issues that we have seen from customers and other sources, we are posting these questions to help the Azure community.

1 answer

Your answer

Answer 1

In Azure AI Search, when dealing with multi-page PDF documents stored in SharePoint, the search functionality indexes the entire document as a single entity. This means if a match is found in any searchable field or subfield within the document, the entire document is returned as a result. However, it does not support performing searches specifically on subfields of complex types directly.

To address this, you can parse the search response in your application to display only the necessary subfields to the user. Additionally, you can consider restructuring your indexed documents. One approach is to index each page of the PDF file as a separate document within AI Search. This approach simplifies the search process as each page becomes an individual document. For example, you can structure your documents like this:


{

  "document": [

    { "pagenumber": 1, "content": "..." },

    { "pagenumber": 2, "content": "hello" }

  ]

}

Be aware that Azure Search has a limitation that the complex objects in collections across a single document cannot exceed maximum of 3000 elements.

Resources:

Azure AI Search Complex Data Types

Please do not forget to "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.

Share via

How to Search Specific Content in Multi-Page PDF Documents in Azure AI Search

1 answer

Your answer