Form Recognizer read (OCR) model
This article applies to: Form Recognizer v3.0.
Note
For extracting text from external images like labels, street signs, and posters, use the Computer Vision v4.0 preview Read feature optimized for general, non-document images with a performance-enhanced synchronous API that makes it easier to embed OCR in your user experience scenarios.
Form Recognizer v3.0's Read Optical Character Recognition (OCR) model runs at a higher resolution than Computer Vision Read and extracts print and handwritten text from PDF documents and scanned images. It also includes preview support for extracting text from Microsoft Word, Excel, PowerPoint, and HTML documents. It detects paragraphs, text lines, words, locations, and languages. The Read model is the underlying OCR engine for other Form Recognizer prebuilt models like Layout, General Document, Invoice, Receipt, Identity (ID) document, in addition to custom models.
What is OCR for documents?
Optical Character Recognition (OCR) for documents is optimized for large text-heavy documents in multiple file formats and global languages. It includes features like higher-resolution scanning of document images for better handling of smaller and dense text; paragraph detection; and fillable form management. OCR capabilities also include advanced scenarios like single character boxes and accurate extraction of key fields commonly found in invoices, receipts, and other prebuilt scenarios.
Read OCR supported document types
Note
- Only API Version 2022-06-30-preview supports Microsoft Word, Excel, PowerPoint, and HTML file formats in addition to all other document types supported by the GA versions.
- For the preview of Office and HTML file formats, Read API ignores the pages parameter and extracts all pages by default. Each embedded image counts as 1 page unit and each worksheet, slide, and page (up to 3000 characters) count as 1 page.
Model | Images | TIFF | Word | Excel | PowerPoint | HTML | |
---|---|---|---|---|---|---|---|
prebuilt-read | GA (2022-08-31) |
GA (2022-08-31) |
GA (2022-08-31) |
Preview (2022-06-30-preview) |
Preview (2022-06-30-preview) |
Preview (2022-06-30-preview) |
Preview (2022-06-30-preview) |
Data extraction
Model | Text | Language detection |
---|---|---|
prebuilt-read | ✓ | ✓ |
Development options
Form Recognizer v3.0 supports the following resources:
Model | Resources | Model ID |
---|---|---|
Read model | prebuilt-read |
Try OCR in Form Recognizer
Try extracting text from forms and documents using the Form Recognizer Studio. You need the following assets:
An Azure subscription—you can create one for free
A Form Recognizer instance in the Azure portal. You can use the free pricing tier (
F0
) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.
Form Recognizer Studio
Note
Currently, Form Recognizer Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats in the Read version v3.0.
Sample form processed with Form Recognizer Studio
On the Form Recognizer Studio home page, select Read
You can analyze the sample document or select the + Add button to upload your own sample.
Select the Analyze button:
Input requirements
For best results, provide one clear photo or high-quality scan per document.
Supported file formats:
Model PDF Image:
JPEG/JPG, PNG, BMP, and TIFFMicrosoft Office:
Word (DOCX), Excel (XLS), PowerPoint (PPT), and HTMLRead ✔ ✔ ✱ REST API version
2022/06/30-preview
Layout ✔ ✔ General Document ✔ ✔ Prebuilt ✔ ✔ Custom ✔ ✔ ✱ Microsoft Office files are currently not supported for other models or versions.
For PDF and TIFF, up to 2000 pages can be processed (with a free tier subscription, only the first two pages are processed).
The file size for analyzing documents must be less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier.
Image dimensions must be between 50 x 50 pixels and 10,000 px x 10,000 pixels.
PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.
If your PDFs are password-locked, you must remove the lock before submission.
The minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about
8
-point text at 150 dots per inch (DPI).For custom model training, the maximum number of pages for training data is 500 for the custom template model and 50,000 for the custom neural model.
For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model.
For custom classification model training, the total size of training data is
1GB
with a maximum of 10,000 pages.
Supported languages and locales
Form Recognizer v3.0 version supports several languages for the read OCR model. See our Language Support for a complete list of supported handwritten and printed languages.
Data detection and extraction
Microsoft Office and HTML text extraction
Use the parameter api-version=2022-06-30-preview
when using the REST API or the corresponding SDKs of that API version to preview text extraction from Microsoft Word, Excel, PowerPoint, and HTML files. The following illustration shows extraction of the digital text and text from the images embedded in the Word document by running OCR on the images.
The page units in the model output are computed as shown:
File format | Computed page unit | Total pages |
---|---|---|
Word | Up to 3,000 characters = 1 page unit, Each embedded image = 1 page unit | Total pages of up to 3,000 characters each + Total embedded images |
Excel | Each worksheet = 1 page unit, Each embedded image = 1 page unit | Total worksheets + Total images |
PowerPoint | Each slide = 1 page unit, Each embedded image = 1 page unit | Total slides + Total images |
HTML | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
Barcode extraction
The Read OCR model extracts all identified barcodes in the barcodes
collection as a top level object under content
. Inside the content
, detected barcodes are represented as :barcode:
. Each entry in this collection represents a barcode and includes the barcode type as kind
and the embedded barcode content as value
along with its polygon
coordinates. Initially, barcodes appear at the end of each page. Here, the confidence
is hard-coded for the public preview (2023-02-28
) release.
Supported barcode types
Barcode Type | Example |
---|---|
QR Code |
![]() |
Code 39 |
![]() |
Code 128 |
![]() |
UPC (UPC-A & UPC-E) |
![]() |
PDF417 |
![]() |
"content": ":barcode:",
"pages": [
{
"pageNumber": 1,
"barcodes": [
{
"kind": "QRCode",
"value": "http://test.com/",
"span": { ... },
"polygon": [...],
"confidence": 1
}
]
}
]
Paragraphs extraction
The Read OCR model in Form Recognizer extracts all identified blocks of text in the paragraphs
collection as a top level object under analyzeResults
. Each entry in this collection represents a text block and includes the extracted text ascontent
and the bounding polygon
coordinates. The span
information points to the text fragment within the top-level content
property that contains the full text from the document.
"paragraphs": [
{
"spans": [],
"boundingRegions": [],
"content": "While healthcare is still in the early stages of its Al journey, we are seeing pharmaceutical and other life sciences organizations making major investments in Al and related technologies.\" TOM LAWRY | National Director for Al, Health and Life Sciences | Microsoft"
}
]
Language detection
The Read OCR model in Form Recognizer adds language detection as a new feature for text lines. Read predicts the detected primary language for each text line along with the confidence
in the languages
collection under analyzeResult
.
"languages": [
{
"spans": [
{
"offset": 0,
"length": 131
}
],
"locale": "en",
"confidence": 0.7
},
]
Extract pages from documents
The page units in the model output are computed as shown:
File format | Computed page unit | Total pages |
---|---|---|
Images | Each image = 1 page unit | Total images |
Each page in the PDF = 1 page unit | Total pages in the PDF | |
TIFF | Each image in the TIFF = 1 page unit | Total images in the PDF |
"pages": [
{
"pageNumber": 1,
"angle": 0,
"width": 915,
"height": 1190,
"unit": "pixel",
"words": [],
"lines": [],
"spans": [],
"kind": "document"
}
]
Extract text lines and words
The Read OCR model extracts print and handwritten style text as lines
and words
. The model outputs bounding polygon
coordinates and confidence
for the extracted words. The styles
collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.
For the preview of Microsoft Word, Excel, PowerPoint, and HTML file support, Read extracts all embedded text as is. For any embedded images, it runs OCR on the images to extract text and append the text from each image as an added entry to the pages
collection. These added entries include the extracted text lines and words, their bounding polygons, confidences, and the spans pointing to the associated text.
"words": [
{
"content": "While",
"polygon": [],
"confidence": 0.997,
"span": {}
},
],
"lines": [
{
"content": "While healthcare is still in the early stages of its Al journey, we",
"polygon": [],
"spans": [],
}
]
Select page (s) for text extraction
For large multi-page PDF documents, use the pages
query parameter to indicate specific page numbers or page ranges for text extraction.
Note
For the preview of Microsoft Word, Excel, PowerPoint, and HTML file support, the Read API ignores the pages parameter and extracts all pages by default.
Handwritten style for text lines (Latin languages only)
The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. This feature is only supported for Latin languages. The following example shows an example JSON snippet.
"styles": [
{
"confidence": 0.95,
"spans": [
{
"offset": 509,
"length": 24
}
"isHandwritten": true
]
}
Next steps
Complete a Form Recognizer quickstart:
Explore our REST API:
Feedback
Submit and view feedback for