Document Extraction cognitive skill

The Document Extraction skill extracts content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills.

Note

This skill isn't bound to Azure AI services and has no Azure AI services key requirement. This skill extracts text and images. Text extraction is free. Image extraction is metered by Azure AI Search. On a free search service, the cost of 20 transactions per indexer per day is absorbed so that you can complete quickstarts, tutorials, and small projects at no charge. For Basic, Standard, and above, image extraction is billable.

@odata.type

Microsoft.Skills.Util.DocumentExtractionSkill

Supported document formats

The DocumentExtractionSkill can extract text from the following document formats:

  • CSV (see Indexing CSV blobs)
  • EML
  • EPUB
  • GZ
  • HTML
  • JSON (see Indexing JSON blobs)
  • KML (XML for geographic representations)
  • Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
  • Open Document formats: ODT, ODS, ODP
  • PDF
  • Plain text files (see also Indexing plain text)
  • RTF
  • XML
  • ZIP

Skill parameters

Parameters are case-sensitive.

Inputs Allowed Values Description
parsingMode default
text
json
Set to default for document extraction from files that aren't pure text or json. For source files that contain mark up (such as PDF, HTML, RTF, and Microsoft Office files), use the default to extract just the text, minus any markup language or tags. If parsingMode isn't defined explicitly, it will be set to default.

Set to text if source files are TXT. This parsing mode improves performance on plain text files. If files include markup, this mode will preserve the tags in the final output.

Set to json to extract structured content from json files.
dataToExtract contentAndMetadata
allMetadata
Set to contentAndMetadata to extract all metadata and textual content from each file. If dataToExtract isn't defined explicitly, it will be set to contentAndMetadata.

Set to allMetadata to extract only the metadata properties for the content type (for example, metadata unique to just .png files).
configuration See below. A dictionary of optional parameters that adjust how the document extraction is performed. See the below table for descriptions of supported configuration properties.
Configuration Parameter Allowed Values Description
imageAction none
generateNormalizedImages
generateNormalizedImagePerPage
Set to none to ignore embedded images or image files in the data set, or if the source data doesn't include image files. This is the default.

For OCR and image analysis, set to generateNormalizedImages to have the skill create an array of normalized images as part of document cracking. This action requires that parsingMode is set to default and dataToExtract is set to contentAndMetadata. A normalized image refers to extra processing resulting in uniform image output, sized and rotated to promote consistent rendering when you include images in visual search results (for example, same-size photographs in a graph control as seen in the JFK demo). This information is generated for each image when you use this option.

If you set to generateNormalizedImagePerPage, PDF files are treated differently in that instead of extracting embedded images, each page is rendered as an image and normalized accordingly. Non-PDF file types are treated the same as if generateNormalizedImages was set.
normalizedImageMaxWidth Any integer between 50-10000 The maximum width (in pixels) for normalized images generated. The default is 2000.
normalizedImageMaxHeight Any integer between 50-10000 The maximum height (in pixels) for normalized images generated. The default is 2000.

Note

The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the OCR skill and the image analysis skill. The OCR skill supports a maximum width and height of 4200 for non-English languages, and 10000 for English. If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.

Skill inputs

Input name Description
file_data The file that content should be extracted from.

The "file_data" input must be an object defined as:

{
  "$type": "file",
  "data": "BASE64 encoded string of the file"
}

Alternatively, it can be defined as:

{
  "$type": "file",
  "url": "URL to download file",
  "sasToken": "OPTIONAL: SAS token for authentication if the URL provided is for a file in blob storage"
}

The file reference object can be generated one of three ways:

  • Setting the allowSkillsetToReadFileData parameter on your indexer definition to "true". This creates a path /document/file_data that is an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Blob storage.

  • Setting the imageAction parameter on your indexer definition to a value other than none. This creates an array of images that follows the required convention for input to this skill if passed individually (that is, /document/normalized_images/*).

  • Having a custom skill return a json object defined EXACTLY as above. The $type parameter must be set to exactly file and the data parameter must be the base 64 encoded byte array data of the file content, or the url parameter must be a correctly formatted URL with access to download the file at that location.

Skill outputs

Output name Description
content The textual content of the document.
normalized_images When the imageAction is set to a value other than none, the new normalized_images field contains an array of images. See Extract text and information from images for more details on the output format.

Sample definition

 {
    "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
    "parsingMode": "default",
    "dataToExtract": "contentAndMetadata",
    "configuration": {
        "imageAction": "generateNormalizedImages",
        "normalizedImageMaxWidth": 2000,
        "normalizedImageMaxHeight": 2000
    },
    "context": "/document",
    "inputs": [
      {
        "name": "file_data",
        "source": "/document/file_data"
      }
    ],
    "outputs": [
      {
        "name": "content",
        "targetName": "extracted_content"
      },
      {
        "name": "normalized_images",
        "targetName": "extracted_normalized_images"
      }
    ]
  }

Sample input

{
  "values": [
    {
      "recordId": "1",
      "data":
      {
        "file_data": {
          "$type": "file",
          "data": "aGVsbG8="
        }
      }
    }
  ]
}

Sample output

{
  "values": [
    {
      "recordId": "1",
      "data": {
        "content": "hello",
        "normalized_images": []
      }
    }
  ]
}

See also