Azure AI Search for PDF indexer

Question

Azure AI Search for PDF indexer

Reddy 25

I am trying to create a PDF indexer using Azure AI Search service and I want to index the pdf documents which are uploaded from my web application (using .Net Core) and these documents are stored in blob storage. My final goal is to search in the Indexed pdf documents and return those matched pdf documents only based on the search text. Is there a solution to return the matched PDF's instead of plain text. Thanks

Reddy 25 Reputation points

2024-03-01T18:18:58.4633333+00:00

Hello @Grmacjon-MSFT , Thank you for your response. The PDF size would be ~3MB - 5MB.Currently, I have created the AI service along with Indexes in azure portal which is connected to blob storage . Now, I want to create a solution to pull the extracted data from the search text. I am new to this topic and It would be very helpful to find any code examples of pull model approach.

Thanks
Vladimir Tsoy 0 Reputation points

2024-03-02T05:22:59.62+00:00

Hi @Grmacjon-MSFT , are the steps the same for larger PDFs that are 500MB+?

Accepted answer

0 additional answers

Your answer

Reddy 25 Reputation points

2024-03-01T18:18:58.4633333+00:00

Hello @Grmacjon-MSFT , Thank you for your response. The PDF size would be ~3MB - 5MB.Currently, I have created the AI service along with Indexes in azure portal which is connected to blob storage . Now, I want to create a solution to pull the extracted data from the search text. I am new to this topic and It would be very helpful to find any code examples of pull model approach.

Thanks
Vladimir Tsoy 0 Reputation points

2024-03-02T05:22:59.62+00:00

Hi @Grmacjon-MSFT , are the steps the same for larger PDFs that are 500MB+?

Answer 1

Hello @Reddy , what is the size of the your PDF docs? While Azure AI Search directly returns extracted text from indexed documents, there isn't a built-in functionality to return the entire PDF document. However, you can achieve your goal of searching and retrieving the original PDF documents based on search text using a combination of Azure AI Search and Azure Blob Storage.

Here's one approach:

1. Indexing PDFs and Extracting Text:

Use Azure Blob Storage: Store your uploaded PDFs in Azure Blob Storage.
Create an Azure AI Search index: Define your search index with appropriate fields, including one for the document's blob storage URL.
Use Azure Blob Indexer: This built-in indexer automatically extracts text content from uploaded PDFs and populates the search index. It will also include the blob storage URL in the indexed data.

2. Searching and Retrieving PDFs:

Perform search through Azure AI Search: Use your search queries to search the indexed text content of your PDFs.
Retrieve matching PDFs: In the search results, you'll receive the blob storage URLs for the matching PDFs.
Access and download PDFs: Use the retrieved blob storage URLs from your .NET Core application to access and download the original PDF documents using the Azure Blob Storage SDK for .NET.

Hope that helps.

-Grace

Anonymous

2024-04-26T18:09:45.0233333+00:00

@Grmacjon-MSFT thanks for your response! They make sense to me. However, in Azure Blob Storage, does the folder/file structure matter? Do I need to put all my PDFs under a single directory?

Share via

Azure AI Search for PDF indexer

0 additional answers

Your answer