Document Extraction cognitive skill

مقالة
08/28/2024

The Document Extraction skill extracts content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills.

Note

This skill isn't bound to Azure AI services and has no Azure AI services key requirement. This skill extracts text and images. Text extraction is free. Image extraction is metered by Azure AI Search. On a free search service, the cost of 20 transactions per indexer per day is absorbed so that you can complete quickstarts, tutorials, and small projects at no charge. For Basic, Standard, and above, image extraction is billable.

@odata.type

Microsoft.Skills.Util.DocumentExtractionSkill

Supported document formats

The DocumentExtractionSkill can extract text from the following document formats:

CSV (see Indexing CSV blobs)
EML
EPUB
GZ
HTML
JSON (see Indexing JSON blobs)
KML (XML for geographic representations)
Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
Open Document formats: ODT, ODS, ODP
PDF
Plain text files (see also Indexing plain text)
RTF
XML
ZIP

Skill parameters

Parameters are case-sensitive.

Inputs Allowed Values Description

Inputs	Allowed Values	Description
`parsingMode`	`default` `text` `json`	Set to `default` for document extraction from files that aren't pure text or json. For source files that contain mark up (such as PDF, HTML, RTF, and Microsoft Office files), use the default to extract just the text, minus any markup language or tags. If `parsingMode` isn't defined explicitly, it will be set to `default`. Set to `text` if source files are TXT. This parsing mode improves performance on plain text files. If files include markup, this mode will preserve the tags in the final output. Set to `json` to extract structured content from json files.
`dataToExtract`	`contentAndMetadata` `allMetadata`	Set to `contentAndMetadata` to extract all metadata and textual content from each file. If `dataToExtract` isn't defined explicitly, it will be set to `contentAndMetadata`. Set to `allMetadata` to extract only the metadata properties for the content type (for example, metadata unique to just .png files).
`configuration`	See below.	A dictionary of optional parameters that adjust how the document extraction is performed. See the below table for descriptions of supported configuration properties.

parsingMode

default
text
json

Set to default for document extraction from files that aren't pure text or json. For source files that contain mark up (such as PDF, HTML, RTF, and Microsoft Office files), use the default to extract just the text, minus any markup language or tags. If parsingMode isn't defined explicitly, it will be set to default.

Set to text if source files are TXT. This parsing mode improves performance on plain text files. If files include markup, this mode will preserve the tags in the final output.

Set to json to extract structured content from json files.

dataToExtract

contentAndMetadata
allMetadata

Set to contentAndMetadata to extract all metadata and textual content from each file. If dataToExtract isn't defined explicitly, it will be set to contentAndMetadata.

Set to allMetadata to extract only the metadata properties for the content type (for example, metadata unique to just .png files).

configuration See below. A dictionary of optional parameters that adjust how the document extraction is performed. See the below table for descriptions of supported configuration properties.

Configuration Parameter Allowed Values Description

Configuration Parameter	Allowed Values	Description
`imageAction`	`none` `generateNormalizedImages` `generateNormalizedImagePerPage`	Set to `none` to ignore embedded images or image files in the data set, or if the source data doesn't include image files. This is the default. For OCR and image analysis, set to `generateNormalizedImages` to have the skill create an array of normalized images as part of document cracking. This action requires that `parsingMode` is set to `default` and `dataToExtract` is set to `contentAndMetadata`. A normalized image refers to extra processing resulting in uniform image output, sized and rotated to promote consistent rendering when you include images in visual search results (for example, same-size photographs in a graph control as seen in the JFK demo). This information is generated for each image when you use this option. If you set to `generateNormalizedImagePerPage`, PDF files are treated differently in that instead of extracting embedded images, each page is rendered as an image and normalized accordingly. Non-PDF file types are treated the same as if `generateNormalizedImages` was set.
`normalizedImageMaxWidth`	Any integer between 50-10000	The maximum width (in pixels) for normalized images generated. The default is 2000.
`normalizedImageMaxHeight`	Any integer between 50-10000	The maximum height (in pixels) for normalized images generated. The default is 2000.

imageAction

none
generateNormalizedImages
generateNormalizedImagePerPage

Set to none to ignore embedded images or image files in the data set, or if the source data doesn't include image files. This is the default.

For OCR and image analysis, set to generateNormalizedImages to have the skill create an array of normalized images as part of document cracking. This action requires that parsingMode is set to default and dataToExtract is set to contentAndMetadata. A normalized image refers to extra processing resulting in uniform image output, sized and rotated to promote consistent rendering when you include images in visual search results (for example, same-size photographs in a graph control as seen in the JFK demo). This information is generated for each image when you use this option.

If you set to generateNormalizedImagePerPage, PDF files are treated differently in that instead of extracting embedded images, each page is rendered as an image and normalized accordingly. Non-PDF file types are treated the same as if generateNormalizedImages was set.

normalizedImageMaxWidth Any integer between 50-10000 The maximum width (in pixels) for normalized images generated. The default is 2000.

normalizedImageMaxHeight Any integer between 50-10000 The maximum height (in pixels) for normalized images generated. The default is 2000.

Note

The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the OCR skill and the image analysis skill. The OCR skill supports a maximum width and height of 4200 for non-English languages, and 10000 for English. If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.

Skill inputs

Input name	Description
`file_data`	The file that content should be extracted from.

The "file_data" input must be an object defined as:

{
  "$type": "file",
  "data": "BASE64 encoded string of the file"
}

Alternatively, it can be defined as:

{
  "$type": "file",
  "url": "URL to download file",
  "sasToken": "OPTIONAL: SAS token for authentication if the URL provided is for a file in blob storage"
}

The file reference object can be generated one of three ways:

Setting the allowSkillsetToReadFileData parameter on your indexer definition to "true". This creates a path /document/file_data that is an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Blob storage.
Setting the imageAction parameter on your indexer definition to a value other than none. This creates an array of images that follows the required convention for input to this skill if passed individually (that is, /document/normalized_images/*).
Having a custom skill return a json object defined EXACTLY as above. The $type parameter must be set to exactly file and the data parameter must be the base 64 encoded byte array data of the file content, or the url parameter must be a correctly formatted URL with access to download the file at that location.

Skill outputs

Output name	Description
`content`	The textual content of the document.
`normalized_images`	When the `imageAction` is set to a value other than `none`, the new normalized_images field contains an array of images. See Extract text and information from images for more details on the output format.

Sample definition

 {
    "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
    "parsingMode": "default",
    "dataToExtract": "contentAndMetadata",
    "configuration": {
        "imageAction": "generateNormalizedImages",
        "normalizedImageMaxWidth": 2000,
        "normalizedImageMaxHeight": 2000
    },
    "context": "/document",
    "inputs": [
      {
        "name": "file_data",
        "source": "/document/file_data"
      }
    ],
    "outputs": [
      {
        "name": "content",
        "targetName": "extracted_content"
      },
      {
        "name": "normalized_images",
        "targetName": "extracted_normalized_images"
      }
    ]
  }

Sample input

{
  "values": [
    {
      "recordId": "1",
      "data":
      {
        "file_data": {
          "$type": "file",
          "data": "aGVsbG8="
        }
      }
    }
  ]
}

Sample output

{
  "values": [
    {
      "recordId": "1",
      "data": {
        "content": "hello",
        "normalized_images": []
      }
    }
  ]
}

مشاركة عبر