Azure AI search returning full PDF instead of relevant answer

Question

Azure AI search returning full PDF instead of relevant answer

Sushant Shelake 5

I have connected my Azure AI search with a Blob storage that contains a PDF document that needs to be crawled. However, when I ask a question in the search, it returns the entire PDF instead of relevant answers. I need help figuring out how to properly crawl the PDF document.

Grmacjon-MSFT 19,301 Reputation points Moderator

2024-03-26T03:22:06.95+00:00

@Sushant Shelake can you please share how you connected your AI search with your blog storage? Are you using the Document Extraction skill to extract the content from your PDF?
Sushant Shelake 5 Reputation points

2024-03-26T07:12:56.1233333+00:00

Can I know what's the ideal way to do this
Sushant Shelake 5 Reputation points

2024-03-26T07:14:58.19+00:00

No I am not using that skill I just added pdf to blob and used import data from ai search to create index.
I am new here so any suggestion will be helpful for me

1 answer

Your answer

Grmacjon-MSFT 19,301 Reputation points Moderator

2024-03-26T03:22:06.95+00:00

@Sushant Shelake can you please share how you connected your AI search with your blog storage? Are you using the Document Extraction skill to extract the content from your PDF?
Sushant Shelake 5 Reputation points

2024-03-26T07:12:56.1233333+00:00

Can I know what's the ideal way to do this
Sushant Shelake 5 Reputation points

2024-03-26T07:14:58.19+00:00

No I am not using that skill I just added pdf to blob and used import data from ai search to create index.
I am new here so any suggestion will be helpful for me

Answer 1

Grmacjon-MSFT 19,301 Moderator

Hi @Sushant Shelake apologies for the delay in response.

Here's how to properly crawl your PDF document and retrieve relevant answers:

1. Enable Text Extraction with Blob Indexer:

Since you've imported your data, next thing to do is activate the "Enable Text Extraction" option. This instructs the indexer to extract text content from the PDF using Azure Cognitive Services (specifically Text Analytics).

2. Analyze Text Extraction Settings:

In your Blob indexer configuration, review the "Text Extraction" settings. You can specify custom skills or cognitive services for handling specific file formats like PDF.
By default, Azure AI Search uses a pre-built skill for text extraction. If your PDFs require advanced processing (e.g., handling complex layouts or tables), consider creating a custom skill using Cognitive Services Text Analytics for more granular control.

3. Search with Relevant Fields:

When formulating your search query, target specific fields extracted from the PDF document. These fields might include extracted text content, metadata, or custom properties defined during indexing.
For example, instead of searching the entire document, search for keywords within the extracted text content field: content:"your search term"

Hope that helps.

-Grace

Nguyen Thanh Binh 0

Hi @Grmacjon-MSFT here is my skill set configuration JSON

{
  "name": "my-skillset",
  "description": "Skillset created from the portal;",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "name": "#1",
      "description": null,
      "context": "/document/content",
      "defaultLanguageCode": "en",
      "maxKeyPhraseCount": null,
      "modelVersion": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "keyPhrases",
          "targetName": "keyphrases"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "#2",
      "description": null,
      "context": "/document",
      "parsingMode": "default",
      "dataToExtract": "contentAndMetadata",
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data"
        }
      ],
      "outputs": [
        {
          "name": "content",
          "targetName": "extracted_content"
        }
      ],
      "configuration": {}
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

However, after my indexer has run successfully with the new skillset configuration (yes I have reset and re-run), extracted_content field always return null while content field always return the full text from the pdf document.

What am I doing wrong here? Is there anyway to get only a small section from a large pdf that relevant to the query?

Grmacjon-MSFT 19,301 Reputation points Moderator

2024-05-16T16:16:32.1866667+00:00

thanks for the additional info. I will look into this and get back to you
Nguyen Thanh Binh 0 Reputation points

2024-05-31T09:40:38.42+00:00

@Grmacjon-MSFT Hi, any update?
Nguyen Thanh Binh 0 Reputation points

2024-05-31T11:31:21.7566667+00:00

@Grmacjon-MSFT Hi! Is there any update?

Share via

Azure AI search returning full PDF instead of relevant answer

1 answer

Your answer