Azure Cognitive Search + Open AI + Image Content

Arun Srinivasan 70 Reputation points
2023-07-12T07:38:20.6433333+00:00

I'm using Azure cognitive search which uses blob data source to integrate with Open AI. It seems to work fine when i upload PDFs and create search index and refer them in the code. But if the PDF contains images (say a graphical image with text on it) or table content (as image) search index doesn seem to pick.

Is there something i'm missing or what is the best solution for this?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
799 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,544 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,441 questions
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,473 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,515 questions
{count} votes

Accepted answer
  1. Tech-Hyd-1989 5,761 Reputation points
    2023-07-12T08:35:51.6933333+00:00

    Hello Arun Srinivasan (Cognizant)

    Azure Cognitive Search does not currently index images or tables in PDFs by default. However, there are a few things you can do to index this content.

    • Use the Document Extraction skill. The Document Extraction skill can extract text and images from PDFs. This text and images can then be indexed by Azure Cognitive Search.
    • Use the Text Merge skill. The Text Merge skill can merge text and images that have been extracted from PDFs. This merged content can then be indexed by Azure Cognitive Search.
    • Use a third-party tool. There are a number of third-party tools that can extract text and images from PDFs. This text and images can then be indexed by Azure Cognitive Search.

    Here are some of the third-party tools that you can use to extract text and images from PDFs:

    • Google Cloud Vision API
    • Microsoft Azure Form Recognizer
    • Amazon Textract
    • ABBYY FineReader

    Once you have extracted the text and images from your PDFs, you can index them using Azure Cognitive Search. To do this, you will need to create a new index and add the following fields to the index:

    • Document ID (required)
    • Text (required)
    • Image (optional)
    • Table (optional)

    Once you have added these fields to the index, you can start indexing your PDFs. To do this, you will need to use the Azure Cognitive Search indexing API.

    I hope this helps! Let me know if you have any other questions.

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. brtrach-MSFT 15,531 Reputation points Microsoft Employee
    2023-07-13T03:15:17.52+00:00

    @Arun Srinivasan I see that there is already an answer provided but I wanted to provide an answer that highlights the technology behind all of this and the limitations that you might be hitting.

    It's important to note that Azure Cognitive Search uses OCR (Optical Character Recognition) to extract text from images and PDFs. OCR is a technology that recognizes text within images and converts it into machine-readable text. However, OCR has some limitations, and it may not be able to recognize all types of images or tables in PDFs.

    According to the Azure Cognitive Search documentation, the OCR feature can recognize the following types of images:

    Text in images

    Handwriting

    Printed text in low-quality images

    Text in multiple languages

    However, the OCR feature may not be able to recognize the following types of images:

    Images with low resolution or poor quality

    Images with complex layouts or backgrounds

    Images with non-standard fonts or symbols

    Tables or charts in PDFs

    It's possible that the OCR feature of Azure Cognitive Search is not able to recognize the images or tables in your PDFs, which is why they are not being indexed.

    To address this issue, you can try the following solutions:

    1. Use a third-party OCR tool: You can use a third-party OCR tool to extract text from images and tables in your PDFs and then upload the extracted text to Azure Cognitive Search. There are several OCR tools available in the market, such as Tesseract or Abbyy.

    Convert images and tables to text: You can convert the images and tables in your PDFs to text using tools such as Adobe Acrobat or Microsoft Word and then upload the text to Azure Cognitive Search.

    Use a custom skill: You can create a custom skill in Azure Cognitive Search that uses a machine learning model to recognize images and tables in your PDFs and extract the text. You can then upload the extracted text to Azure Cognitive Search.


  2. Panda, Abhimanyu (Cognizant) 0 Reputation points
    2024-05-23T09:41:24.1533333+00:00

    We're building a tool to extract information from PDFs, including text, tables, and images. We've been trying Azure Document Intelligence and Cloud Vision API, but we're facing some challenges. Is there any news or updates from Microsoft on these services or other solutions that could help us?*

    0 comments No comments