Azure Skill set for reading the PDF file text from the blob storage

Nishanth Sekar 1 Reputation point
2024-01-18T15:24:52.7666667+00:00

I am trying to create a skill set for extracting the text from the PDF that is stored in BLOB using a document extraction skill set.

 **Should I use the document extraction skills for getting the PDF content or how do I vectorise the content of the PDF file?**
Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,195 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,143 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Grmacjon-MSFT 18,816 Reputation points
    2024-01-19T00:06:54.9166667+00:00

    Hello @Nishanth S
    The JSON you’re using is for the PdftextextractionSkill, which is a skill for extracting text from PDFs However, you might want to use the DocumentExtractionSkill instead, which is a more general-purpose skill that can extract content from a variety of file formats, including PDF. Here’s an example of what the JSON for a DocumentExtractionSkill might look like:

    {
      "name": "document-extraction-skillset",
      "description": "Skillset for extracting text from documents",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
          "context": "/document",
          "inputs": [
            {
              "name": "file_data",
              "source": "/document/file_data"
            }
          ],
          "outputs": [
            {
              "name": "content",
              "targetName": "content"
            }
          ]
        }
      ]
    }
    

    In this JSON, the DocumentExtractionSkill is is used to extract the content from the document. The file_data input is set to the file_data field of the document, which represents the original file data downloaded from your blob data source Also, make sure to set the allowSkillsetToReadFileData parameter on your indexer definition to true. This creates a path /document/file_data that is an object representing the original file data downloaded from your blob data source. Hope that helps. -Grace

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.