A Microsoft Purview trainable classifier is a tool you can train to recognize various types of content by giving it samples to look at. Once trained, you can use it to identify item for application of Office sensitivity labels, Communications compliance policies, and retention label policies.
Two steps are required for implementing a custom trainable classifier:
Provide two sets of sample data (selected by humans).
A set that contains only items that belong in the category.
A set that contains only items that do not belong in the category.
Test the classifier's ability to detect matches.
This article explains how to create and test a custom classifier.
If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview trials hub. Learn details about signing up and trial terms.
Prerequisites
Licensing requirements
Classifiers are a feature in Microsoft 365 E3 and E5 Compliance. You must have one of these subscriptions to make use of them.
Permissions
To use classifiers in the following scenarios, you need the following permissions:
Scenario
Required Role Permissions
Retention label policy
Record Management Retention Management
Sensitivity label policy
Security Administrator Compliance Administrator Compliance Data Administrator
To ensure that your trainable classifier can independently and accurately identify that an item belongs to a particular category of content, you must present it with many samples of the type of content that is in the category. This feeding of samples to the trainable classifier is known as seeding. A human must be the one to select seed content, and that content must include two sets of data: one that contains only items that strongly represent the content the classifier is designed to detect (positive samples) and a second set of items that clearly don't belong (negative samples).
At least 50 positive samples (up to 500) and at least 150 negative samples (up to 1500) are required to train a classifier. The more samples you provide, the more accurate the predictions the classifier makes will be. The trainable classifier processes up to the 2000 most recently created samples (by file created date/time stamp).
Tip
For best results, have at least 200 items in your test sample set that includes at least 50 positive examples and at least 150 negative examples.
How to create a trainable classifier
Select the appropriate tab for the portal you're using. Depending on your Microsoft 365 plan, the Microsoft Purview compliance portal is retired or will be retired soon.
In preview: The following process automates the testing of trainable classifiers and shortens the creation workflow from 12 days to two days. (In some cases, the process can take only a few hours.)
Collect a second set of seed content (from 150 - 1500 items) that represents data that don't belong in the category.
Place the positive and negative seed content in separate SharePoint folders. Each folder must be dedicated to holding only the seed content. Make note of the site, library, and folder URL for each set.
Tip
If you create a new SharePoint site and folder for your seed data, allow at least an hour for that location to be indexed before creating the trainable classifier that will use that seed data.
Add the source of your positive examples: select the SharePoint site, library, and folder URL for the seed content that should be detected by the classifier and then choose Next.
Add the source of your negative examples: select the SharePoint site, library, and folder URL for the seed content that should be ignored by the classifier and then choose Next.
Review the settings and choose Create trainable classifier.
Within 24 hours or less, the trainable classifier processes the seed data and builds a prediction model. The classifier status is In progress while it processes the seed data. When the classifier is finished processing the seed data, the status changes to Training is complete and items have been tested.
Once training is complete and items have been (automatically) tested, publish the classifier by choosing Publish for use.
Once the trainable classifier processes enough positive and negative samples to build a prediction model, you need to test the predictions it makes. In testing the classifier, you verify whether its predictions are correct. Once all of the data is processed, go through the results manually and verify whether each prediction is correct, incorrect, or you aren't sure. Microsoft uses this feedback in aggregate to improve the prediction model.
This module introduces you to data classification in Microsoft 365, including how to create and train classifiers, view sensitive data using Content explorer and Activity explorer, and implement Document Fingerprinting.
Document fingerprinting makes it easier for you to protect information by identifying standard forms that are used by your organization. This article describes the concepts behind document fingerprinting and how to create one by using PowerShell.
This article gives an overview of sensitive information types and how they detect sensitive information like social security, credit card, or bank account numbers to identify sensitive items.