How to Search Specific Content in Multi-Page PDF Documents in Azure AI Search

SwathiDhanwada-MSFT 18,996 Reputation points Moderator
2024-08-01T07:13:40.9666667+00:00

How can I perform a search on specific content within multi-page PDF documents stored in SharePoint using Azure AI Search, and what are the limitations?

PS - Based on common issues that we have seen from customers and other sources, we are posting these questions to help the Azure community.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,353 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. SwathiDhanwada-MSFT 18,996 Reputation points Moderator
    2024-08-01T07:14:28.72+00:00

    In Azure AI Search, when dealing with multi-page PDF documents stored in SharePoint, the search functionality indexes the entire document as a single entity. This means if a match is found in any searchable field or subfield within the document, the entire document is returned as a result. However, it does not support performing searches specifically on subfields of complex types directly.

    To address this, you can parse the search response in your application to display only the necessary subfields to the user. Additionally, you can consider restructuring your indexed documents. One approach is to index each page of the PDF file as a separate document within AI Search. This approach simplifies the search process as each page becomes an individual document. For example, you can structure your documents like this:

    
    {
    
      "document": [
    
        { "pagenumber": 1, "content": "..." },
    
        { "pagenumber": 2, "content": "hello" }
    
      ]
    
    }
    
    

    Be aware that Azure Search has a limitation that the complex objects in collections across a single document cannot exceed maximum of 3000 elements.

    Resources:

    Please do not forget to "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.