I’m sorry to hear you’re having trouble. Yes, it is possible for Azure Cognitive Search to return the most relevant contents across multiple PDF files. You can use Azure Blob Storage as the data source of your Azure Cognitive Search and upload your PDF files to the Azure Blob Storage container. You can also use the Document Extraction skill to extract content from a file within the enrichment pipeline.
How to Extract Paragraph Across Multiple PDF files with Azure Cognitive Search
Hi Experts, I'm using Azure Blob Storage as the data source of my Azure Cognitive Search, and the blobs are all PDF files. I split the PDF files by pages and store each page as a standalone PDF files, then upload to the Azure Blob Storage container. Due to the structure of the original PDF files, some paragraph is across multiple pages and thus the content of that paragraph would be across multiple PDF files after the splitting. May I know is it possible for Azure Cognitive Search to return the most relevant contents across multiple PDF files, combine those contents into together, and then return as the search results ?
2 answers
Sort by: Most helpful
-
-
SnehaAgrawal-MSFT 19,281 Reputation points
2023-04-11T09:09:27.73+00:00 @albertoliu1993 Thanks for reaching here! Yes. It's possible with Azure Cognitive Search. Azure Cognitive Search provides a skill called Document Extraction that can extract content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills. However, to extract paragraphs across multiple PDF files, you would need to define a skillset that includes the Document Extraction skill and other skills that can help you achieve your goal. You can extract all text from PDF text elements azure Cognitive Search blob indexer can extract text PDF and other document formats, listed in this document.
Also if you need to extract text from embedded images you can use OCR cognitive skill. See: https://learn.microsoft.com/azure/search/cognitive-search-concept-intro Suggest you to refer below docs which shows you how to configure an azure blob indexer to extract content and make it searchable in Azure Cognitive Search.
Please let us know if further query or issue remains, happy to help you.