Explore trainable classifiers

Completed

Organizations classify and label content so they can protect and properly handle it. Classifying and labeling content is the starting place for the information protection discipline. Microsoft 365 has three ways to classify content:

  • Manually. Manual classification requires human judgment and action. Users and admins apply them to content as they encounter it. You can use either the pre-existing labels and sensitive information types or use custom created ones. You can then protect the content and manage its disposition.
  • Automated pattern-matching. This category of classification mechanisms includes finding content by:
    • Keywords or metadata values (keyword query language).
    • Using previously identified patterns of sensitive information like social security, credit card, or bank account numbers.
    • Recognizing an item because it's a variation on a template (document finger printing, which a later unit in this training covers).
    • Using the presence of exact strings exact data match.
  • Trainable classifiers. A Microsoft 365 trainable classifier is a tool that an organization can "train" to recognize various types of content. Microsoft 365 includes an extensive list of predefined classifiers. Organizations can also create their own custom classifiers. You can train classifiers by giving them samples to look at. Once you train a classifier, the organization can use it to identify items for application of Office sensitivity labels, Communications compliance policies, and retention label policies.

This unit examines the use of trainable classifiers.

Trainable classifiers

To begin using trainable classifiers in Microsoft Purview, you can first initiate a scanning process. This process analyzes your company's data and identifies patterns the system can use to train the classifier. After the system scans your data, it identifies common themes and patterns. The system can then create rules for the trainable classifier using this information. This process helps to ensure the trainable classifier is accurate and effective in identifying and categorizing data. Once the scanning process finishes, you can train the trainable classifier using the identified patterns and rules. Once you finish training the classifier, you can apply it to new data to automatically classify it.

Warning

It can take 7 to 14 days for scanning to complete. If you don't want to run the scanning process to create a custom training classifier for your organization, you can use Microsoft Purview's built-in classifiers.

The first time you access the Training classifiers page in the Microsoft Purview compliance portal, the following screenshot appears.

Screenshot of the dialog box that appears the first time you access the training classifiers page in the Microsoft Purview compliance portal.

Creating a custom trainable classifier first involves giving it samples that you manually picked and that positively matches the category. Then, after the trainable classifier tool processes those samples, you test the classifiers' ability to predict by giving it a mix of positive and negative samples. This unit examines how to create and train a custom classifier. It also examines how to improve the performance of custom trainable classifiers and pretrained classifiers over their lifetime through retraining.

The classification method works well on content that automated or manual pattern-matching methods can't easily identify. This method of classification is more about using a classifier to identify an item based on what the item is, not by elements that are in the item (pattern matching). A classifier learns how to identify a type of content by looking at hundreds of examples of that content type.

Note

You can view trainable classifiers in the Content explorer tool by expanding Trainable Classifiers in the filters panel. The trainable classifiers automatically display the number of incidents found in SharePoint, Teams, and OneDrive, without requiring any labeling. If you don't want to use this feature, you must file a request with Microsoft Support to disable out-of-the-box classification. Doing so disables the scanning of your sensitive and labeled content before you create labeling policies.

Classifiers are available to use as a condition for:

  • Office autolabeling with sensitivity labels
  • Automatically applying a retention label policy based on a condition
  • Communication compliance

Note

Classifiers only work with items that aren't encrypted.

There are two types of trainable classifiers:

  • Pretrained classifiers. Microsoft created and pretrained multiple classifiers that you can start using without training them. These classifiers appear with the status of Ready to use.
  • Custom trainable classifiers. If an organization has classification needs that extend beyond what the pretrained classifiers cover, it can create and train its own classifiers.

The following sections examine these classifier types.

Pretrained classifiers

Microsoft 365 comes with multiple pretrained classifiers:

  • Adult, Racy, and Gory. Detects images of these types. The images must be between 50 kilobytes (KB) and 4 megabytes (MB) in size. They must also be greater than 50 x 50 pixels in height x width dimensions. The system supports scanning and detection for Exchange Online email messages and Microsoft Teams channels and chats.

  • Agreements. This classifier detects content related to legal agreements. For example, statements of work, loan and lease agreements, and employment and noncompete agreements.

  • Customer Complaints. The customer complaints classifier detects feedback and complaints made about your organization's products or services. This classifier can help you meet regulatory requirements on the detection and triage of complaints, like the Consumer Financial Protection Bureau and Food and Drug Administration requirements.

  • Discrimination. This classifier detects explicit discriminatory language and is sensitive to discriminatory language against the African American/Black communities when compared to other communities.

  • Finance. This classifier detects content in corporate finance, accounting, economy, banking, and investment categories.

  • Harassment. This classifier detects a specific category of offensive language text items. These items must relate to offensive conduct that targets one or multiple individuals based on the following traits: race, ethnicity, religion, national origin, gender, sexual orientation, age, disability.

  • Healthcare. This classifier detects content in medical and healthcare administration aspects. For example, medical services, diagnoses, treatment, claims, and so on.

  • Human Resources (HR). This classifier detects content in human resources related categories. For example, recruitment, interviewing, hiring, training, evaluating, warning, and termination.

  • Intellectual Property (IP). This classifier detects content in intellectual property-related categories such as trade secrets and similar confidential information.

  • Information Technology (IT). This classifier detects content in Information Technology and Cybersecurity categories. For example, network settings, information security, hardware, and software.

  • Legal Affairs. This classifier detects content in legal affairs-related categories. For example, litigation, legal process, legal obligation, legal terminology, law, and legislation.

  • Procurement. This classifier detects content in categories of bidding, quoting, purchasing, and paying for supply of goods and services.

  • Profanity. This classifier detects a specific category of offensive language text items that contain expressions that embarrass most people.

  • Resumes. This classifier detects docx, .pdf, .rtf, and .txt items that are textual accounts of an applicant's personal, educational, professional qualifications, work experience, and other personally identifying information.

  • Source Code. This classifier detects items that contain a set of instructions and statements written in the top 25 used computer programming languages on GitHub: ActionScript, C, C#, C++, Clojure, CoffeeScript, Go, Haskell, Java, JavaScript, Lua, MATLAB, Objective-C, Perl, PHP, Python, R, Ruby, Scala, Shell, Swift, TeX, Vim Script.

    Note

    The Source Code classifier detects when the bulk of the text is source code. It doesn't detect source code text interspersed with plain text.

  • Tax. This classifier detects Tax relation content such as tax planning, tax forms, tax filing, tax regulations.

  • Threat. This classifier detects a specific category of offensive language text items related to threats to commit violence or do physical harm or damage to a person or property.

These trainable classifiers appear in the Microsoft Purview compliance portal. In the navigation pane, select Data classification. On the Data classification page, select the Trainable classifiers tab. View the classifiers with the status of Ready to use.

Custom classifiers

For some organizations, the pretrained classifiers don't meet their data classification needs. In this situation, an organization can create and train its own classifiers. There's more work involved with creating a custom classifier, but an organization can tailor them to better fit its needs. The high-level steps involved in creating a custom classifier include:

  1. You start creating a custom trainable classifier by feeding it examples that are definitely in the category.
  2. Once the classifier processes those examples, you test it by giving it a mix of both matching and nonmatching examples.
  3. The classifier then makes predictions as to whether any given item falls into the category you're building.
  4. You then confirm its results, sorting out the true positives, true negatives, false positives, and false negatives to help increase the accuracy of its predictions.
  5. Once you're satisfied with the test results, you deploy the classifier by publishing it.

When you publish the classifier, it sorts through items in locations like SharePoint Online, Exchange, and OneDrive, and classifies the content. After you publish the classifier, you can continue to train it using a feedback process that's similar to the initial training process.

For example, you could create trainable classifiers for:

  • Legal documents. For example, attorney client privilege, closing sets, and statements of work.
  • Strategic business documents. For example, press releases, merger and acquisition, deals, business or marketing plans, intellectual property, patents, and design docs.
  • Pricing information. For example, invoices, price quotes, work orders, and bidding documents.
  • Financial information. For example, organizational investments, and quarterly or annual results.

Prepare for a custom trainable classifier

Before diving in, it's helpful to understand the components involved in creating a custom trainable classifier. The following sections examine each of these components.

Timeline

The following diagram displays a timeline that reflects a sample deployment of trainable classifiers.

Diagram showing the timeline for creating a sample deployment of trainable classifiers.

Tip

The system requires a First time only Opt-in for trainable classifiers. It takes 12 days for Microsoft 365 to complete a baseline evaluation of an organization's content. A Microsoft 365 Global administrator must kick off the opt-in process.

Overall workflow

To understand more about the overall workflow of creating custom trainable classifiers, see Process flow for creating custom trainable classifiers.

Seed content

Microsoft Purview uses trainable classifiers to independently and accurately identify an item as being in particular category of content. To create a trainable classifier, an organization must first present it with many samples of the type of content that are in the category. Seeding is the process of feeding samples to the trainable classifier. An organization must select the seed content that it wants to use to represent the category of content.

Tip

You must have at least 50 positive samples, with a maximum of 500. samples. The trainable classifier processes up to the 500 most recent created samples (by file created date/time stamp). The more samples you provide, the more accurate the predictions the classifier makes.

Testing content

Once the trainable classifier processes enough positive samples to build a prediction model, the organization must test the predictions the classifier makes. You should test with different data than the initial seed data you first provided. The testing should verify whether the classifier can correctly distinguish between items that match the category and items that don't. Testing should begin by selecting another, hopefully larger, set of manually selected content, known as the test sample. It should consist of samples that fall into the category and samples that don't.

Once the classifier processes this test sample, you must manually review the results. When doing so, you should verify whether each prediction is correct, incorrect, or you aren't sure. The trainable classifier uses this feedback to improve its prediction model.

Tip

For best results, have at least 200 items in your test sample. It should include an even distribution of positive and negative matches.