How to Extract Paragraph Across Multiple PDF files with Azure Cognitive Search

albertoliu1993 0 Reputation points
2023-04-09T19:22:23.2633333+00:00

Hi Experts, I'm using Azure Blob Storage as the data source of my Azure Cognitive Search, and the blobs are all PDF files. I split the PDF files by pages and store each page as a standalone PDF files, then upload to the Azure Blob Storage container. Due to the structure of the original PDF files, some paragraph is across multiple pages and thus the content of that paragraph would be across multiple PDF files after the splitting. May I know is it possible for Azure Cognitive Search to return the most relevant contents across multiple PDF files, combine those contents into together, and then return as the search results ?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
757 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Elgin Tarot Resolutions 0 Reputation points
    2023-04-09T23:02:25.7466667+00:00

    I’m sorry to hear you’re having trouble. Yes, it is possible for Azure Cognitive Search to return the most relevant contents across multiple PDF files. You can use Azure Blob Storage as the data source of your Azure Cognitive Search and upload your PDF files to the Azure Blob Storage container. You can also use the Document Extraction skill to extract content from a file within the enrichment pipeline.

    0 comments No comments

  2. SnehaAgrawal-MSFT 18,871 Reputation points
    2023-04-11T09:09:27.73+00:00

    @albertoliu1993 Thanks for reaching here! Yes. It's possible with Azure Cognitive Search. Azure Cognitive Search provides a skill called Document Extraction that can extract content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills. However, to extract paragraphs across multiple PDF files, you would need to define a skillset that includes the Document Extraction skill and other skills that can help you achieve your goal. You can extract all text from PDF text elements azure Cognitive Search blob indexer can extract text PDF and other document formats, listed in this document.

    Also if you need to extract text from embedded images you can use OCR cognitive skill. See: https://learn.microsoft.com/azure/search/cognitive-search-concept-intro Suggest you to refer below docs which shows you how to configure an azure blob indexer to extract content and make it searchable in Azure Cognitive Search.

    Please let us know if further query or issue remains, happy to help you.

    0 comments No comments