Identify sensitive data with optical character recognition (preview)

Completed

Optical character recognition (OCR) scanning enables Microsoft Purview to scan content in images for sensitive information. This feature is multilingual, supporting more than 150 languages. OCR is an optional feature that is first turned on at the tenant level. After it's enabled, you choose the locations where you want to scan images. Image scanning is available for Exchange, SharePoint, OneDrive, Teams, and Windows devices. Once set up, your existing policies for data loss prevention (DLP), records management, and insider risk management (IRM) are applied to images and text-based content. For example, let's say that you've configured the DLP condition content contains sensitive information and included a data classifier such as the "Credit Card" sensitive information type (SIT). In this case, Microsoft Purview scans credit card numbers in both text and images at all chosen locations.

Workflow at a glance

Phase What's needed
Phase 1: Create Azure subscription if needed If your organization doesn't already have an Azure pay-as-you-go subscription for your tenant, your Global admin needs to start by creating an Azure account.
Phase 2: Set up pay-as-you-go billing to enable OCR. Your Global or SharePoint admin must follow the instructions in Set up Microsoft Syntex billing in Azure to add a subscription for OCR.
Phase 3: Configure OCR scanning settings The Compliance admin for your organization configures the OCR settings for your tenant.

Phase 1: Prerequisites

To use OCR scanning, your organization's Global admin needs to verify that an Azure pay-as-you-go subscription is in place. If not, they need to set that up, following the instructions in Create your initial Azure subscriptions

Phase 2: Configure billing

When you enable OCR, all sensitive information types and trainable classifiers can detect characters that are in images.

Because it's an optional feature, your Global admin must set up pay-as-you-go billing to enable OCR. For more information on OCR billing, see Phase 2: Configure billing.

Phase 3: Configure your OCR settings

  1. In the Microsoft Purview compliance portal, go to Settings.
  2. Select Optical character recognition (OCR) (preview) to enter your OCR configuration settings.
  3. Select the locations you want to scan images. Then, for each location and solution, define the scope (users/groups/sites) for the OCR. Supported locations and solutions are discussed later in this unit.

OCR settings generally take effect about an hour after being turned on.

Permissions

The account you use to create and deploy policies must be a member of one of these role groups:

  • Compliance administrator
  • Compliance data administrator
  • Information Protection
  • Information Protection Admin

Supported locations and solutions

Location Supported solutions
Exchange Data loss prevention 1

Information protection: Auto-labeling policies1

Records management: Auto-apply retention label policies2
SharePoint sites Data loss prevention

Insider risk management3

Records management: Auto-apply retention label policies2
OneDrive accounts Data loss prevention

Records management: Auto-apply retention label policies2
Teams chat and channel messages Data loss prevention

Insider risk management3
Devices Data loss prevention

Insider risk management3

1 Supports outgoing emails only.
2 Supports keywords and sensitive information types.
3 Considers sensitive information types and trainable classifiers present in images for risk scoring.

Supported file types

This functionality supports scanning images in the following file types, with the noted requirements:

Supported file types Image requirements
JPEG, JPG, PNG, BMP, TIFF, and PDF (image only) File sizes: Image files must be no larger than 20 MB for Exchange and Teams. For SharePoint, OneDrive, and Windows endpoints, the maximum image file size is 50 MB.

Image resolution: Image resolution must be at least 50 x 50 pixels and not larger than 16,000 x 16,000 pixels.

Limitations

  • Only images with machine-typed text are supported.
  • Only images uploaded after OCR has been enabled are scanned.
  • Only stand-alone images are scanned.
  • SharePoint and OneDrive support only the following file types: JPEG, JPG, PNG, and BMP.
  • Data loss prevention policy tips aren't supported for images in Exchange.
  • Scanning images in compressed/archive files isn't supported.
  • If you exclude a path in the endpoint data loss prevention settings, OCR doesn't scan images in those folders.
  • When OCR is turned on for Windows devices, the devices start sending messages to the cloud for scanning. The default bandwidth limit is 1024 MB of data per device per day. OCR stops scanning images once this daily limit is reached. If you want to continue scanning images, you can increase the bandwidth limit.

Help protect sensitive images with Microsoft Purview interactive guide

Use this interactive guide to learn how to help protect sensitive images with Microsoft Purview. In this guide, you learn how to configure OCR to detect sensitive information in images.

Cover for an interactive guide that says Help protect sensitive images with Microsoft Purview interactive guide.

Learn more