Azure Form Recognizer layout model

This article applies to: Form Recognizer v3.0 checkmark Form Recognizer v3.0. Earlier version: Form Recognizer v2.1

This article applies to: Form Recognizer v2.1 checkmark Form Recognizer v2.1. Later version: Form Recognizer v3.0

Form Recognizer layout model is an advanced machine-learning based document analysis API available in the Form Recognizer cloud. It enables you to take documents in various formats and return structured data representations of the documents. It combines an enhanced version of our powerful Optical Character Recognition (OCR) capabilities with deep learning models to extract text, tables, selection marks, and document structure.

Document layout analysis

Document structure layout analysis is the process of analyzing a document to extract regions of interest and their inter-relationships. The goal is to extract text and structural elements from the page to build better semantic understanding models. There are two types of roles that text plays in a document layout:

  • Geometric roles: Text, tables, and selection marks are examples of geometric roles.
  • Logical roles: Titles, headings, and footers are examples of logical roles.

The following illustration shows the typical components in an image of a sample page.

Illustration of document layout example.

Sample form processed with Form Recognizer Studio

Screenshot of sample newspaper page processed using Form Recognizer Studio.

Development options

Form Recognizer v3.0 supports the following tools:

Feature Resources Model ID
Layout model prebuilt-layout

Sample document processed with Form Recognizer Sample Labeling tool layout model:

Screenshot of a document processed with the layout model.

Input requirements

  • For best results, provide one clear photo or high-quality scan per document.

  • Supported file formats:

    Model PDF Image:
    JPEG/JPG, PNG, BMP, and TIFF
    Microsoft Office:
    Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML
    Read REST API version
    2022/06/30-preview
    Layout
    General Document
    Prebuilt
    Custom

    ✱ Microsoft Office files are currently not supported for other models or versions.

  • For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).

  • The file size for analyzing documents must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.

  • Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.

  • PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.

  • If your PDFs are password-locked, you must remove the lock before submission.

  • The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).

  • For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.

  • For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.

  • For custom classification model training, the total size of training data is 1GB with a maximum of 10,000 pages.

  • Supported file formats: JPEG, PNG, PDF, and TIFF
  • For PDF and TIFF, up to 2000 pages are processed. For free tier subscribers, only the first two pages are processed.
  • The file size must be less than 50 MB and dimensions at least 50 x 50 pixels and at most 10,000 x 10,000 pixels.

Try layout extraction

See how data, including text, tables, table headers, selection marks, and structure information is extracted from documents using Form Recognizer. You need the following resources:

  • An Azure subscription—you can create one for free

  • A Form Recognizer instance in the Azure portal. You can use the free pricing tier (F0) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.

Screenshot: keys and endpoint location in the Azure portal.

Form Recognizer Studio

Note

Form Recognizer studio is available with the v3.0 API.

Sample form processed with Form Recognizer Studio

Screenshot: Layout processing a newspaper page in Form Recognizer Studio.

  1. On the Form Recognizer Studio home page, select Layout

  2. You can analyze the sample document or select the + Add button to upload your own sample.

  3. Select the Analyze button:

    Screenshot: analyze layout menu.

Form Recognizer Sample Labeling tool

  1. Navigate to the Form Recognizer sample tool.

  2. On the sample tool home page, select Use Layout to get text, tables and selection marks.

    Screenshot of connection settings for the Form Recognizer layout process.

  3. In the Form recognizer service endpoint field, paste the endpoint that you obtained with your Form Recognizer subscription.

  4. In the key field, paste the key you obtained from your Form Recognizer resource.

  5. In the Source field, select URL from the dropdown menu You can use our sample document:

  6. Select Run Layout. The Form Recognizer Sample Labeling tool calls the Analyze Layout API and analyze the document.

    Screenshot: Layout dropdown window.

  7. View the results - see the highlighted text extracted, selection marks detected and tables detected.

    Screenshot of connection settings for the Form Recognizer Sample Labeling tool.

Supported document types

Model Images PDF TIFF
Layout

Supported languages and locales

See Language Support for a complete list of supported handwritten and printed languages.

Data extraction

Starting with v3.0 GA, it extracts paragraphs and more structure information like titles, section headings, page header, page footer, page number, and footnote from the document page. These structural elements are examples of logical roles described in the previous section. This capability is supported for PDF documents and images (JPG, PNG, BMP, TIFF).

Model Text Selection Marks Tables Paragraphs Logical roles
Layout

Supported logical roles for paragraphs: The paragraph roles are best used with unstructured documents. Paragraph roles help analyze the structure of the extracted content for better semantic search and analysis.

  • title
  • sectionHeading
  • footnote
  • pageHeader
  • pageFooter
  • pageNumber

Data extraction support

Model Text Tables Selection marks
Layout

Form Recognizer v2.1 supports the following tools:

Feature Resources
Layout API

Model extraction

The layout model extracts text, selection marks, tables, paragraphs, and paragraph types (roles) from your documents.

Paragraph extraction

The Layout model extracts all identified blocks of text in the paragraphs collection as a top level object under analyzeResults. Each entry in this collection represents a text block and includes the extracted text ascontentand the bounding polygon coordinates. The span information points to the text fragment within the top level content property that contains the full text from the document.

"paragraphs": [
    {
        "spans": [],
        "boundingRegions": [],
        "content": "While healthcare is still in the early stages of its Al journey, we are seeing pharmaceutical and other life sciences organizations making major investments in Al and related technologies.\" TOM LAWRY | National Director for Al, Health and Life Sciences | Microsoft"
    }
]

Paragraph roles

The new machine-learning based page object detection extracts logical roles like titles, section headings, page headers, page footers, and more. The Form Recognizer Layout model assigns certain text blocks in the paragraphs collection with their specialized role or type predicted by the model. They're best used with unstructured documents to help understand the layout of the extracted content for a richer semantic analysis. The following paragraph roles are supported:

Predicted role Description
title The main heading(s) in the page
sectionHeading One or more subheading(s) on the page
footnote Text near the bottom of the page
pageHeader Text near the top edge of the page
pageFooter Text near the bottom edge of the page
pageNumber Page number
{
    "paragraphs": [
                {
                    "spans": [],
                    "boundingRegions": [],
                    "role": "title",
                    "content": "NEWS TODAY"
                },
                {
                    "spans": [],
                    "boundingRegions": [],
                    "role": "sectionHeading",
                    "content": "Mirjam Nilsson"
                }
    ]
}

Pages extraction

The pages collection is the first object you see in the service response.

"pages": [
    {
        "pageNumber": 1,
        "angle": 0,
        "width": 915,
        "height": 1190,
        "unit": "pixel",
        "words": [],
        "lines": [],
        "spans": [],
        "kind": "document"
    }
]

Text lines and words extraction

The document layout model in Form Recognizer extracts print and handwritten style text as lines and words. The model outputs bounding polygon coordinates and confidence for the extracted words. The styles collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.

"words": [
    {
        "content": "While",
        "polygon": [],
        "confidence": 0.997,
        "span": {}
    },
],
"lines": [
    {
        "content": "While healthcare is still in the early stages of its Al journey, we",
        "polygon": [],
        "spans": [],
    }
]

Selection marks extraction

The Layout model also extracts selection marks from documents. Extracted selection marks appear within the pages collection for each page. They include the bounding polygon, confidence, and selection state (selected/unselected). Any associated text if extracted is also included as the starting index (offset) and length that references the top level content property that contains the full text from the document.

{
    "selectionMarks": [
        {
            "state": "unselected",
            "polygon": [],
            "confidence": 0.995,
            "span": {
                "offset": 1421,
                "length": 12
            }
        }
    ]
}

Extract tables from documents and images

Extracting tables is a key requirement for processing documents containing large volumes of data typically formatted as tables. The Layout model extracts tables in the pageResults section of the JSON output. Extracted table information includes the number of columns and rows, row span, and column span. Each cell with its bounding polygon is output along with information whether it's recognized as a columnHeader or not. The model supports extracting tables that are rotated. Each table cell contains the row and column index and bounding polygon coordinates. For the cell text, the model outputs the span information containing the starting index (offset). The model also outputs the length within the top-level content that contains the full text from the document.

{
    "tables": [
        {
            "rowCount": 9,
            "columnCount": 4,
            "cells": [
                {
                    "kind": "columnHeader",
                    "rowIndex": 0,
                    "columnIndex": 0,
                    "columnSpan": 4,
                    "content": "(In millions, except earnings per share)",
                    "boundingRegions": [],
                    "spans": []
                    },
            ]
        }
    ]
}

Handwritten style for text lines (Latin languages only)

The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. This feature is only supported for Latin languages. The following example shows an example JSON snippet.

"styles": [
{
    "confidence": 0.95,
    "spans": [
    {
        "offset": 509,
        "length": 24
    }
    "isHandwritten": true
    ]
}

Annotations extraction

The Layout model extracts annotations in documents, such as checks and crosses. The response includes the kind of annotation, along with a confidence score and bounding polygon.

    {
  "pages": [
    {
      "annotations": [
        {
          "kind": "cross",
          "polygon": [...],
          "confidence": 1
        }
      ]
    }
  ]
}

Barcode extraction

The Layout model extracts all identified barcodes in the barcodes collection as a top level object under content. Inside the content, detected barcodes are represented as :barcode:. Each entry in this collection represents a barcode and includes the barcode type as kind and the embedded barcode content as value along with its polygon coordinates. Initially, barcodes appear at the end of each page.

Supported barcode types

Barcode Type Example
QR Code Screenshot of the QR Code.
Code 39 Screenshot of the Code 39.
Code 128 Screenshot of the Code 128.
UPC (UPC-A & UPC-E) Screenshot of the UPC.
PDF417 Screenshot of the PDF417.

Note

The confidence score is hard-coded for the 2023-02-28 public preview.

"content": ":barcode:",
  "pages": [
    {
      "pageNumber": 1,
      "barcodes": [
        {
          "kind": "QRCode",
          "value": "http://test.com/",
          "span": { ... },
          "polygon": [...],
          "confidence": 1
        }
      ]
    }
  ]

Extract selected pages from documents

For large multi-page documents, use the pages query parameter to indicate specific page numbers or page ranges for text extraction.

Natural reading order output (Latin only)

You can specify the order in which the text lines are output with the readingOrder query parameter. Use natural for a more human-friendly reading order output as shown in the following example. This feature is only supported for Latin languages.

Screenshot of layout model reading order processing.

Select page numbers or ranges for text extraction

For large multi-page documents, use the pages query parameter to indicate specific page numbers or page ranges for text extraction. The following example shows a document with 10 pages, with text extracted for both cases - all pages (1-10) and selected pages (3-6).

Screen shot of the layout model selected pages output.

The Get Analyze Layout Result operation

The second step is to call the Get Analyze Layout Result operation. This operation takes as input the Result ID the Analyze Layout operation created. It returns a JSON response that contains a status field with the following possible values.

Field Type Possible values
status string notStarted: The analysis operation hasn't started.

running: The analysis operation is in progress.

failed: The analysis operation has failed.

succeeded: The analysis operation has succeeded.

Call this operation iteratively until it returns the succeeded value. Use an interval of 3 to 5 seconds to avoid exceeding the requests per second (RPS) rate.

When the status field has the succeeded value, the JSON response includes the extracted layout, text, tables, and selection marks. The extracted data includes extracted text lines and words, bounding boxes, text appearance with handwritten indication, tables, and selection marks with selected/unselected indicated.

Handwritten classification for text lines (Latin only)

The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. This feature is only supported for Latin languages. The following example shows the handwritten classification for the text in the image.

Screenshot of layout model handwriting classification process.

Sample JSON output

The response to the Get Analyze Layout Result operation is a structured representation of the document with all the information extracted. See here for a sample document file and its structured output sample layout output.

The JSON output has two parts:

  • readResults node contains all of the recognized text and selection mark. The text presentation hierarchy is page, then line, then individual words.
  • pageResults node contains the tables and cells extracted with their bounding boxes, confidence, and a reference to the lines and words in "readResults".

Example Output

Text

Layout API extracts text from documents and images with multiple text angles and colors. It accepts photos of documents, faxes, printed and/or handwritten (English only) text, and mixed modes. Text is extracted with information provided on lines, words, bounding boxes, confidence scores, and style (handwritten or other). All the text information is included in the readResults section of the JSON output.

Tables with headers

Layout API extracts tables in the pageResults section of the JSON output. Documents can be scanned, photographed, or digitized. Tables can be complex with merged cells or columns, with or without borders, and with odd angles. Extracted table information includes the number of columns and rows, row span, and column span. Each cell with its bounding box is output along with information whether it's recognized as part of a header or not. The model predicted header cells can span multiple rows and aren't necessarily the first rows in a table. They also work with rotated tables. Each table cell also includes the full text with references to the individual words in the readResults section.

Tables example

Selection marks

Layout API also extracts selection marks from documents. Extracted selection marks include the bounding box, confidence, and state (selected/unselected). Selection mark information is extracted in the readResults section of the JSON output.

Migration guide

Next steps