Azure Cognitive Search + Open AI + Image Content

Question

I'm using Azure cognitive search which uses blob data source to integrate with Open AI. It seems to work fine when i upload PDFs and create search index and refer them in the code. But if the PDF contains images (say a graphical image with text on it) or table content (as image) search index doesn seem to pick.

Is there something i'm missing or what is the best solution for this?

Accepted Answer

Hello Arun Srinivasan (Cognizant)

Azure Cognitive Search does not currently index images or tables in PDFs by default. However, there are a few things you can do to index this content.

Use the Document Extraction skill. The Document Extraction skill can extract text and images from PDFs. This text and images can then be indexed by Azure Cognitive Search.
Use the Text Merge skill. The Text Merge skill can merge text and images that have been extracted from PDFs. This merged content can then be indexed by Azure Cognitive Search.
Use a third-party tool. There are a number of third-party tools that can extract text and images from PDFs. This text and images can then be indexed by Azure Cognitive Search.

Here are some of the third-party tools that you can use to extract text and images from PDFs:

Google Cloud Vision API
Microsoft Azure Form Recognizer
Amazon Textract
ABBYY FineReader

Once you have extracted the text and images from your PDFs, you can index them using Azure Cognitive Search. To do this, you will need to create a new index and add the following fields to the index:

Document ID (required)
Text (required)
Image (optional)
Table (optional)

Once you have added these fields to the index, you can start indexing your PDFs. To do this, you will need to use the Azure Cognitive Search indexing API.

I hope this helps! Let me know if you have any other questions.

Answer

@Arun Srinivasan I see that there is already an answer provided but I wanted to provide an answer that highlights the technology behind all of this and the limitations that you might be hitting.

It's important to note that Azure Cognitive Search uses OCR (Optical Character Recognition) to extract text from images and PDFs. OCR is a technology that recognizes text within images and converts it into machine-readable text. However, OCR has some limitations, and it may not be able to recognize all types of images or tables in PDFs.

According to the Azure Cognitive Search documentation, the OCR feature can recognize the following types of images:

Text in images

Handwriting

Printed text in low-quality images

Text in multiple languages

However, the OCR feature may not be able to recognize the following types of images:

Images with low resolution or poor quality

Images with complex layouts or backgrounds

Images with non-standard fonts or symbols

Tables or charts in PDFs

It's possible that the OCR feature of Azure Cognitive Search is not able to recognize the images or tables in your PDFs, which is why they are not being indexed.

To address this issue, you can try the following solutions:

Use a third-party OCR tool: You can use a third-party OCR tool to extract text from images and tables in your PDFs and then upload the extracted text to Azure Cognitive Search. There are several OCR tools available in the market, such as Tesseract or Abbyy.

Convert images and tables to text: You can convert the images and tables in your PDFs to text using tools such as Adobe Acrobat or Microsoft Word and then upload the text to Azure Cognitive Search.

Use a custom skill: You can create a custom skill in Azure Cognitive Search that uses a machine learning model to recognize images and tables in your PDFs and extract the text. You can then upload the extracted text to Azure Cognitive Search.

Answer

We're building a tool to extract information from PDFs, including text, tables, and images. We've been trying Azure Document Intelligence and Cloud Vision API, but we're facing some challenges. Is there any news or updates from Microsoft on these services or other solutions that could help us?*

Share via

Azure Cognitive Search + Open AI + Image Content

2 additional answers