Hello Arun Srinivasan (Cognizant)
Azure Cognitive Search does not currently index images or tables in PDFs by default. However, there are a few things you can do to index this content.
- Use the Document Extraction skill. The Document Extraction skill can extract text and images from PDFs. This text and images can then be indexed by Azure Cognitive Search.
- Use the Text Merge skill. The Text Merge skill can merge text and images that have been extracted from PDFs. This merged content can then be indexed by Azure Cognitive Search.
- Use a third-party tool. There are a number of third-party tools that can extract text and images from PDFs. This text and images can then be indexed by Azure Cognitive Search.
Here are some of the third-party tools that you can use to extract text and images from PDFs:
- Google Cloud Vision API
- Microsoft Azure Form Recognizer
- Amazon Textract
- ABBYY FineReader
Once you have extracted the text and images from your PDFs, you can index them using Azure Cognitive Search. To do this, you will need to create a new index and add the following fields to the index:
- Document ID (required)
- Text (required)
- Image (optional)
- Table (optional)
Once you have added these fields to the index, you can start indexing your PDFs. To do this, you will need to use the Azure Cognitive Search indexing API.
I hope this helps! Let me know if you have any other questions.