How to Extract Paragraph Across Multiple PDF files with Azure Cognitive Search

Question

How to Extract Paragraph Across Multiple PDF files with Azure Cognitive Search

albertoliu1993 0

Hi Experts, I'm using Azure Blob Storage as the data source of my Azure Cognitive Search, and the blobs are all PDF files. I split the PDF files by pages and store each page as a standalone PDF files, then upload to the Azure Blob Storage container. Due to the structure of the original PDF files, some paragraph is across multiple pages and thus the content of that paragraph would be across multiple PDF files after the splitting. May I know is it possible for Azure Cognitive Search to return the most relevant contents across multiple PDF files, combine those contents into together, and then return as the search results ?

SnehaAgrawal-MSFT 22,706 Reputation points Moderator

2023-04-12T06:03:00.4066667+00:00

@albertoliu1993 Just checking if you have chance to see recent response.

Please let us know if further query or issue remains.

Please accept as "Yes" if the answer provided is useful , so that you can help others in the community looking for remediation for similar issues.

2 answers

Your answer

SnehaAgrawal-MSFT 22,706 Reputation points Moderator

2023-04-12T06:03:00.4066667+00:00

@albertoliu1993 Just checking if you have chance to see recent response.

Please let us know if further query or issue remains.

Please accept as "Yes" if the answer provided is useful , so that you can help others in the community looking for remediation for similar issues.

Answer 1

Elgin Tarot Resolutions 0

I’m sorry to hear you’re having trouble. Yes, it is possible for Azure Cognitive Search to return the most relevant contents across multiple PDF files. You can use Azure Blob Storage as the data source of your Azure Cognitive Search and upload your PDF files to the Azure Blob Storage container. You can also use the Document Extraction skill to extract content from a file within the enrichment pipeline.

Answer 2

@albertoliu1993 Thanks for reaching here! Yes. It's possible with Azure Cognitive Search. Azure Cognitive Search provides a skill called Document Extraction that can extract content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills. However, to extract paragraphs across multiple PDF files, you would need to define a skillset that includes the Document Extraction skill and other skills that can help you achieve your goal. You can extract all text from PDF text elements azure Cognitive Search blob indexer can extract text PDF and other document formats, listed in this document.

Also if you need to extract text from embedded images you can use OCR cognitive skill. See: https://learn.microsoft.com/azure/search/cognitive-search-concept-intro Suggest you to refer below docs which shows you how to configure an azure blob indexer to extract content and make it searchable in Azure Cognitive Search.

Please let us know if further query or issue remains, happy to help you.

Share via

How to Extract Paragraph Across Multiple PDF files with Azure Cognitive Search

2 answers

Your answer