Learn about optical character recognition in Microsoft Purview

Optical character recognition (OCR) scanning enables Microsoft Purview to scan content in images for sensitive information. An optional feature, OCR scanning is first enabled at the tenant level. Once enabled, you select the locations where you want to scan images. Image scanning is available for Exchange, SharePoint, OneDrive, Teams, and Windows devices. Once the OCR settings are configured, your existing policies for data loss prevention (DLP), records management, and insider risk management (IRM) are applied to images and text-based content. For example, say that you've configured the DLP condition content contains sensitive information and included a data classifier such as the "Credit Card" sensitive information type (SIT). In this case, Microsoft Purview scans for credit card numbers in both text and images at all of the chosen locations.

Workflow at a glance

Phase What's needed
Phase 1: Create Azure subscription if needed If your organization doesn't already have an Azure pay-as-you-go subscription for your tenant, your Global admin needs to start by creating an Azure account.
Phase 2: Set up pay-as-you-go billing to enable OCR. Your Global or SharePoint admin must follow the instructions in Set up Microsoft Syntex billing in Azure to add a subscription for OCR.
Phase 3: Configure OCR scanning settings The Compliance admin for your organization configures the OCR settings for your tenant.

Phase 1: Prerequisites

To use OCR scanning, your organization's Global admin needs to verify that an Azure pay-as-you-go subscription is in place. If not, they need to set that up, following the instructions in Create your initial Azure subscriptions

Phase 2: Configure billing

When you enable OCR, all sensitive information types and trainable classifiers can detect characters that are in images.

Because it's an optional feature, your Global admin must set up pay-as-you-go billing to enable OCR. Refer to the instructions in Set up Microsoft Syntex billing in Azure to add a subscription for OCR.

Note

Once billing information is entered in Microsoft Syntex, your Compliance admin can configure OCR in Microsoft Purview, without any additional setup or licensing requirements.

You can find OCR pay-as-you-go pricing information on the Set up Microsoft Syntex billing in Azure page.

Charges

The charge for using OCR is $1.00 for every 1,000 items scanned. Each image scanned counts as one transaction. This means that stand-alone images (JPEG, JPG, PNG, BMP, or TIFF) each count as a single transaction. It also means that each page in a PDF file is charged separately. For example, if there are 10 pages in a PDF file, an OCR scan of the PDF file counts as 10 separate scans.

Note

To reduce your OCR costs, charges for scanning each unique image are incurred only once.

Small images, such as logos and signatures that are sent in email via Microsoft Exchange are scanned and billed only once per unique image across all users of the tenant. For all subsequent instances, the results of the previous scan will be reused.

Additionally, each scanned image can be used in any number of policies across data loss prevention, insider risk management, auto-labeling, and records management at no additional charge.

Important

For information about the Adobe requirements for using Microsoft Purview Data Loss Prevention (DLP) features with PDF files, see this article from Adobe: Microsoft Purview Information Protection Support in Acrobat.

To view your bill, follow the instructions described in Monitor your Microsoft Syntex pay-as-you-go usage.

Estimate your bill

When you first start using OCR, limit usage to just a few people and applicable workloads. After a short while, you can view your bill in Azure and see the usage statistics & charges for each day. From there, you can extrapolate the costs for your full set of users. In addition, you can use the "workload" tag in Azure cost management to see the breakdown of usage per workload.

Phase 3: Configure your OCR settings

  1. In the Microsoft Purview compliance portal, go to Settings.
  2. Select Optical character recognition (OCR) to enter your OCR configuration settings.
  3. Select the locations where you wish to scan images.
  4. Select the distribution groups that you want included or excluded from OCR scans.
  5. Choose Done

Supported locations and solutions are listed in the table below.

Permissions

The account you use to create and deploy policies must be a member of one of these role groups

  • Compliance administrator
  • Compliance data administrator
  • Global administrator
  • Information Protection
  • Information Protection Admin

Note

Supported Locations and Solutions

Location Supported Solutions
Exchange Data loss prevention

Information protection: Auto-labeling policies

Records management: Auto-apply retention label policies1
SharePoint sites Data loss prevention

Insider risk management2

Records management: Auto-apply retention label policies1
OneDrive accounts Data loss prevention

Records management: Auto-apply retention label policies1
Teams chat and channel messages Data loss prevention

Insider risk management2
Devices Data loss prevention

Insider risk management2

1 Supports keywords and sensitive information types.
2 Considers sensitive information types and trainable classifiers present in images for risk scoring.


What file types are supported?

This functionality supports scanning images in the following file types, with the noted requirements:

Supported file types Image requirements
JPEG, JPG, PNG, BMP, TIFF, and PDF (image only) File sizes: Image files must be no larger than 20 MB for Exchange and Teams. For SharePoint, OneDrive, and Windows endpoints, the maximum image file size is 50 MB.

Image resolution: Image resolution must be at least 50 x 50 pixels and not larger than 16,000 x 16,000 px.

Important

  • Only images uploaded after OCR has been enabled are scanned.
  • Both incoming email (email from users outside the organization) and outgoing email (email sent from users inside the organization) are subject to OCR scanning. To restrict OCR scans to outgoing emails only, change the OCR settings from the default scope of All distribution groups to the specific distribution group(s) and specify the internal distribution groups that you want OCR to scan. For information on changing this configuration, see Phase 3: Configure your OCR settings.
  • Data loss prevention policy tips are not supported for images in Exchange.
  • If you exclude a path in the endpoint data loss prevention settings, OCR will not scan images in those folders.
  • When OCR is turned on for Windows devices, the devices start sending messages to the cloud for scanning. The default bandwidth limit is 1024 MB of data per device per day. OCR stops scanning images once this daily limit is reached. If you want to continue scanning images, you can increase the bandwidth limit.

What languages are supported?

OCR scanning supports more than 150 languages.

Summary

See also