Get started with trainable classifiers

A Microsoft Purview trainable classifier is a tool you can train to recognize various types of content by giving it samples to look at. Once trained, you can use it to identify item for application of Office sensitivity labels, Communications compliance policies, and retention label policies.

Two steps are required for implementing a custom trainable classifier:

  1. Provide two sets of sample data (selected by humans).
    1. A set that contains only items that belong in the category.
    2. A set that contains only items that do not belong in the category.
  2. Test the classifier's ability to detect matches.

This article explains how to create and test a custom classifier.

To learn more about the different types of classifiers, see Learn about trainable classifiers.

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Prerequisites

Licensing requirements

Classifiers are a feature in Microsoft 365 E3 and E5 Compliance. You must have one of these subscriptions to make use of them.

Permissions

To use classifiers in the following scenarios, you need the following permissions:

Scenario Required Role Permissions
Retention label policy Record Management
Retention Management
Sensitivity label policy Security Administrator
Compliance Administrator
Compliance Data Administrator
Communication compliance policy Insider Risk Management Administrator
Supervisory Review Administrator

Important

By default, only the user who creates a custom classifier can train and review predictions made by that classifier.

Prepare for a custom trainable classifier

It's helpful to understand what's involved in creating a custom trainable classifier before you dive in.

Overall workflow

To understand more about the overall workflow of creating custom trainable classifiers, see the process flow for creating custom trainable classifiers.

Seed content

To ensure that your trainable classifier can independently and accurately identify that an item belongs to a particular category of content, you must present it with many samples of the type of content that is in the category. This feeding of samples to the trainable classifier is known as seeding. A human must be the one to select seed content, and that content must include two sets of data: one that contains only items that strongly represent the content the classifier is designed to detect (positive samples) and a second set of items that clearly don't belong (negative samples).

At least 50 positive samples (up to 500) and at least 150 negative samples (up to 1500) are required to train a classifier. The more samples you provide, the more accurate the predictions the classifier makes will be. The trainable classifier processes up to the 2000 most recently created samples (by file created date/time stamp).

Tip

For best results, have at least 200 items in your test sample set that includes at least 50 positive examples and at least 150 negative examples.

How to create a trainable classifier

Select the appropriate tab for the portal you're using. To learn more about the Microsoft Purview portal, see Microsoft Purview portal. To learn more about the Compliance portal, see Microsoft Purview compliance portal.

In preview: The following process automates the testing of trainable classifiers and shortens the creation workflow from 12 days to two days. (In some cases, the process can take only a few hours.)

  1. Collect between 50-500 seed content items that strongly represent the data you want the classifier to positively identify as being in the category. For a list of supported file types, see Default crawled file name extensions and parsed file types in SharePoint Server.

  2. Collect a second set of seed content (from 150 - 1500 items) that represents data that don't belong in the category.

  3. Place the positive and negative seed content in separate SharePoint folders. Each folder must be dedicated to holding only the seed content. Make note of the site, library, and folder URL for each set.

    Tip

    If you create a new SharePoint site and folder for your seed data, allow at least an hour for that location to be indexed before creating the trainable classifier that will use that seed data.

  4. Sign in to either the Microsoft Purview portal or the Microsoft Purview compliance portal with either Compliance admin or Security admin role access and navigate to Data loss prevention > Data classification > Classifiers.

  5. Choose the Trainable classifiers tab.

  6. Choose Create trainable classifier.

  7. Add the source of your positive examples: select the SharePoint site, library, and folder URL for the seed content that should be detected by the classifier and then choose Next.

  8. Add the source of your negative examples: select the SharePoint site, library, and folder URL for the seed content that should be ignored by the classifier and then choose Next.

  9. Review the settings and choose Create trainable classifier.

  10. Within 24 hours or less, the trainable classifier processes the seed data and builds a prediction model. The classifier status is In progress while it processes the seed data. When the classifier is finished processing the seed data, the status changes to Training is complete and items have been tested.

  11. Once training is complete and items have been (automatically) tested, publish the classifier by choosing Publish for use.

Once published, your classifier is available as a condition in Office auto-labeling with sensitivity labels, autoapply retention label policy based on a condition and in Communication compliance.

Test your classifier

Once the trainable classifier processes enough positive and negative samples to build a prediction model, you need to test the predictions it makes. In testing the classifier, you verify whether its predictions are correct. Once all of the data is processed, go through the results manually and verify whether each prediction is correct, incorrect, or you aren't sure. Microsoft uses this feedback in aggregate to improve the prediction model.

See also