Form Recognizer read (OCR) model

This article applies to: Form Recognizer v3.0 checkmark Form Recognizer v3.0.

Note

For extracting text from external images like labels, street signs, and posters, use the Computer Vision v4.0 preview Read feature optimized for general, non-document images with a performance-enhanced synchronous API that makes it easier to embed OCR in your user experience scenarios.

Form Recognizer v3.0's Read Optical Character Recognition (OCR) model runs at a higher resolution than Computer Vision Read and extracts print and handwritten text from PDF documents and scanned images. It also includes preview support for extracting text from Microsoft Word, Excel, PowerPoint, and HTML documents. It detects paragraphs, text lines, words, locations, and languages. The Read model is the underlying OCR engine for other Form Recognizer prebuilt models like Layout, General Document, Invoice, Receipt, Identity (ID) document, in addition to custom models.

What is OCR for documents?

Optical Character Recognition (OCR) for documents is optimized for large text-heavy documents in multiple file formats and global languages. It includes features like higher-resolution scanning of document images for better handling of smaller and dense text; paragraph detection; and fillable form management. OCR capabilities also include advanced scenarios like single character boxes and accurate extraction of key fields commonly found in invoices, receipts, and other prebuilt scenarios.

Read OCR supported document types

Note

  • Only API Version 2022-06-30-preview supports Microsoft Word, Excel, PowerPoint, and HTML file formats in addition to all other document types supported by the GA versions.
  • For the preview of Office and HTML file formats, Read API ignores the pages parameter and extracts all pages by default. Each embedded image counts as 1 page unit and each worksheet, slide, and page (up to 3000 characters) count as 1 page.
Model Images PDF TIFF Word Excel PowerPoint HTML
prebuilt-read GA
(2022-08-31)
GA
(2022-08-31)
GA
(2022-08-31)
Preview
(2022-06-30-preview)
Preview
(2022-06-30-preview)
Preview
(2022-06-30-preview)
Preview
(2022-06-30-preview)

Data extraction

Model Text Language detection
prebuilt-read

Development options

Form Recognizer v3.0 supports the following resources:

Model Resources Model ID
Read model prebuilt-read

Try OCR in Form Recognizer

Try extracting text from forms and documents using the Form Recognizer Studio. You need the following assets:

  • An Azure subscription—you can create one for free

  • A Form Recognizer instance in the Azure portal. You can use the free pricing tier (F0) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.

Screenshot: keys and endpoint location in the Azure portal.

Form Recognizer Studio

Note

Currently, Form Recognizer Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats in the Read version v3.0.

Sample form processed with Form Recognizer Studio

Screenshot: Read processing in Form Recognizer Studio.

  1. On the Form Recognizer Studio home page, select Read

  2. You can analyze the sample document or select the + Add button to upload your own sample.

  3. Select the Analyze button:

    Screenshot: analyze read menu.

Input requirements

  • For best results, provide one clear photo or high-quality scan per document.

  • Supported file formats:

    Model PDF Image:
    JPEG/JPG, PNG, BMP, and TIFF
    Microsoft Office:
    Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTML
    Read REST API version
    2022/06/30-preview
    Layout
    General Document
    Prebuilt
    Custom

    ✱ Microsoft Office files are currently not supported for other models or versions.

  • For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).

  • The file size for analyzing documents must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.

  • Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.

  • PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.

  • If your PDFs are password-locked, you must remove the lock before submission.

  • The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).

  • For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.

  • For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.

  • For custom classification model training, the total size of training data is 1GB with a maximum of 10,000 pages.

Supported languages and locales

Form Recognizer v3.0 version supports several languages for the read OCR model. See our Language Support for a complete list of supported handwritten and printed languages.

Data detection and extraction

Microsoft Office and HTML text extraction

Use the parameter api-version=2022-06-30-preview when using the REST API or the corresponding SDKs of that API version to preview text extraction from Microsoft Word, Excel, PowerPoint, and HTML files. The following illustration shows extraction of the digital text and text from the images embedded in the Word document by running OCR on the images.

Screenshot of a Microsoft Word document extracted by Form Recognizer Read OCR.

The page units in the model output are computed as shown:

File format Computed page unit Total pages
Word Up to 3,000 characters = 1 page unit, Each embedded image = 1 page unit Total pages of up to 3,000 characters each + Total embedded images
Excel Each worksheet = 1 page unit, Each embedded image = 1 page unit Total worksheets + Total images
PowerPoint Each slide = 1 page unit, Each embedded image = 1 page unit Total slides + Total images
HTML Up to 3,000 characters = 1 page unit, embedded or linked images not supported Total pages of up to 3,000 characters each

Barcode extraction

The Read OCR model extracts all identified barcodes in the barcodes collection as a top level object under content. Inside the content, detected barcodes are represented as :barcode:. Each entry in this collection represents a barcode and includes the barcode type as kind and the embedded barcode content as value along with its polygon coordinates. Initially, barcodes appear at the end of each page. Here, the confidence is hard-coded for the public preview (2023-02-28) release.

Supported barcode types

Barcode Type Example
QR Code Screenshot of the QR Code.
Code 39 Screenshot of the Code 39.
Code 128 Screenshot of the Code 128.
UPC (UPC-A & UPC-E) Screenshot of the UPC.
PDF417 Screenshot of the PDF417.
"content": ":barcode:",
  "pages": [
    {
      "pageNumber": 1,
      "barcodes": [
        {
          "kind": "QRCode",
          "value": "http://test.com/",
          "span": { ... },
          "polygon": [...],
          "confidence": 1
        }
      ]
    }
  ]

Paragraphs extraction

The Read OCR model in Form Recognizer extracts all identified blocks of text in the paragraphs collection as a top level object under analyzeResults. Each entry in this collection represents a text block and includes the extracted text ascontentand the bounding polygon coordinates. The span information points to the text fragment within the top-level content property that contains the full text from the document.

"paragraphs": [
    {
        "spans": [],
        "boundingRegions": [],
        "content": "While healthcare is still in the early stages of its Al journey, we are seeing pharmaceutical and other life sciences organizations making major investments in Al and related technologies.\" TOM LAWRY | National Director for Al, Health and Life Sciences | Microsoft"
    }
]

Language detection

The Read OCR model in Form Recognizer adds language detection as a new feature for text lines. Read predicts the detected primary language for each text line along with the confidence in the languages collection under analyzeResult.

"languages": [
    {
        "spans": [
            {
                "offset": 0,
                "length": 131
            }
        ],
        "locale": "en",
        "confidence": 0.7
    },
]

Extract pages from documents

The page units in the model output are computed as shown:

File format Computed page unit Total pages
Images Each image = 1 page unit Total images
PDF Each page in the PDF = 1 page unit Total pages in the PDF
TIFF Each image in the TIFF = 1 page unit Total images in the PDF
"pages": [
    {
        "pageNumber": 1,
        "angle": 0,
        "width": 915,
        "height": 1190,
        "unit": "pixel",
        "words": [],
        "lines": [],
        "spans": [],
        "kind": "document"
    }
]

Extract text lines and words

The Read OCR model extracts print and handwritten style text as lines and words. The model outputs bounding polygon coordinates and confidence for the extracted words. The styles collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.

For the preview of Microsoft Word, Excel, PowerPoint, and HTML file support, Read extracts all embedded text as is. For any embedded images, it runs OCR on the images to extract text and append the text from each image as an added entry to the pages collection. These added entries include the extracted text lines and words, their bounding polygons, confidences, and the spans pointing to the associated text.

"words": [
    {
        "content": "While",
        "polygon": [],
        "confidence": 0.997,
        "span": {}
    },
],
"lines": [
    {
        "content": "While healthcare is still in the early stages of its Al journey, we",
        "polygon": [],
        "spans": [],
    }
]

Select page (s) for text extraction

For large multi-page PDF documents, use the pages query parameter to indicate specific page numbers or page ranges for text extraction.

Note

For the preview of Microsoft Word, Excel, PowerPoint, and HTML file support, the Read API ignores the pages parameter and extracts all pages by default.

Handwritten style for text lines (Latin languages only)

The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. This feature is only supported for Latin languages. The following example shows an example JSON snippet.

"styles": [
{
    "confidence": 0.95,
    "spans": [
    {
        "offset": 509,
        "length": 24
    }
    "isHandwritten": true
    ]
}

Next steps

Complete a Form Recognizer quickstart:

Explore our REST API: