Test an exact data match sensitive information type

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

After your exact data match (EDM) sensitive information type (SIT) has been created, and an hour after verifying that your sensitive information table has finished uploading and indexing, you can test whether or not it detects the information you want to protect by using the Test function in the Sensitive information types section in the Microsoft Purview Compliance Portal.

Note

Changes in an existing EDM SIT can take some time to propagate across the system. If you are making changes to an EDM SIT in order to troubleshoot detection issues, make sure to wait at least one hour after making those changes before using the Test function to validate their impact.

Regardless of the method you use for testing, the test results will include matches for both the specific EDM SIT and for the primary elements that are configured for that EDM SIT.

Methods for testing your EDM SIT

There are two methods that you can use to test your EDM SIT.

Method Available in New EDM experience Available in New and Classic EDM experience
Sensitive information type (SIT) method Yes Yes
EDM classifiers method Yes No

Note

If you are using the Classic EDM experience, you must use the SIT method.

Testing an EDM SIT with the Sensitive Information Types method

To test an EDM SIT with the Sensitive Information Types method, take the following steps.

Select the appropriate tab for the portal you're using. To learn more about the Microsoft Purview portal, see Microsoft Purview portal. To learn more about the Compliance portal, see Microsoft Purview compliance portal.

  1. Sign in to the Microsoft Purview portal > Information Protection > Classifiers > Sensitive info types.

  2. Select your EDM SIT from the list and then select the Test icon.

  3. In the flyout pane, upload a file that contains data you want to detect. For example, create a file that contains a subset of the rows in your sensitive information table. If you used the configurable match feature in your schema to define ignored delimiters, make sure the sample file includes examples with and without those delimiters.

  4. Choose Test.

  5. After the file has been uploaded and scanned, check for matches to your EDM SIT.

  6. If the Test function in the SIT detects a match, verify that the SIT isn't trimming it or extracting the matched item incorrectly. Common issues include SITs that:

    • Extract only a substring of the full string that should be detected
    • Pick up only the first word in a multi-word string
    • Include extra symbols or characters in the extraction

For details about using regular expressions, see the Regular Expression Language - Quick Reference.

Test your EDM SIT with the EDM Classifier method

Select the appropriate tab for the portal you're using. To learn more about the Microsoft Purview portal, see Microsoft Purview portal. To learn more about the Compliance portal, see Microsoft Purview compliance portal.

  1. Sign in to the Microsoft Purview portal > Information Protection > Classifiers > EDM classifiers.

  2. Make sure that the New EDM experience toggle is set to On.

  3. Select your EDM SIT from the list and then select the Test icon.

  4. Upload a file that contains data you want to detect. For example, create a file that contains a subset of the rows in your sensitive information table. If you used the configurable match feature in your schema to define ignored delimiters, make sure your sample file includes examples with and without those delimiters.

  5. After the file has been uploaded and scanned, check for matches to your EDM SIT.

  6. If the Test function in the SIT detects a match, verify that the SIT isn't trimming it or extracting the matched item incorrectly. Common issues include SITs that:

    • Extract only a substring of the full string that should be detected
    • Pick up only the first word in a multi-word string
    • Include extra symbols or characters in the extraction

Test your EDM SIT using PowerShell

To test using PowerShell, use the following PowerShell cmdlet:

Test-DataClassification  -ClassificationNames “[Your EDM sensitive info type]” -TexttoClassify “[your own text to scan for matches]” 

Regardless of the method you use for testing, the test results will include matches for both the specific EDM SIT and for the primary elements that are configured for that EDM SIT.

Note

When you create a or edit an EDM sensitive information type or the primary SIT on which an EDM type is based, all new content (as well as content that is modified after you make changes to the SITs) will be crawled for content that matches the new definitions. However, pre-existing content won’t be crawled until it is modified or re-indexed.

To force re-crawling of existing content in a SharePoint site or library, or in OneDrive, follow the instructions in Manually request crawling and re-indexing of a site, a library or a list.

Test your EDM SIT with information protection policies

You can see where your EDM SIT is being used, and how accurate it is in production, by using it in policies:

  1. Create an auto-labeling policy and run it in Simulation overview.

  2. Add some content that will trigger the EDM SIT, along with content that won't trigger the EDM SIT, to a location that your policy is monitoring.

  3. Open the Items to review tab to check the matches.

  4. Tune your policies as appropriate.

Once you're satisfied with the results of your testing and tuning, your EDM-based custom SIT is ready for use in information protection policies, for instance:

Troubleshooting tips

If your EDM SIT doesn't detect any matches in your data, the following tips might help you diagnose the problem.

Issue Troubleshooting tip
No matches found Confirm that your sensitive data was uploaded correctly using the commands explained in Hash and upload the sensitive information source table for exact data match sensitive information types.
No matches found Test the SIT you used when you configured the primary element in each of your patterns. This test verifies whether the SIT can match the examples in the item. Using an incorrectly defined SIT as the classification element of an EDM SIT is the most common cause for detection failures in EDM.
The SIT you selected for a primary element in the EDM type doesn't find a match in the item or finds fewer matches than you expected Confirm that the SIT supports the separators and delimiters that occur in the content. Be sure to include the ignored delimiters defined in your schema.
The SIT associated with your primary element finds matches in your content, but the EDM SIT doesn't.
  • Check whether your REGEX statements are catching whitespace delimiters at the start or end of an item you want to detect. For instance, look for statements that include the \s delimiter. If whitespace delimiters are included, the whitespace won't match the hashed value in the data table. Instead, use a word delimiter, such as \b.
  • Check your REGEX statements to ensure that they capture the entire string you want to detect, not just a substring. For example, consider this pattern for email addresses: \b[a-zA-Z]{2,30}@[a-zA-Z]{2,20}.[a-zA-Z]{2,3}\b. This pattern will correctly match user@contoso.com, but will only capture user@contoso.co.jp in an incomplete form.
An EDM SIT with primary elements, but no defined secondary elements, detects items but doesn't detect matches (or detects fewer matches than expected) when both primary and secondary elements are required. If values in a column used for secondary evidence are not composed of a single word or of strings that don't contain spaces, commas, or other word separators, there are two ways to test:
  1. Select the multi-token matching option.
  2. Associate the values with a SIT that uses either a REGEX designed to detect multi-word strings that follow the desired pattern (e.g., a fixed number of consecutive words that start with an uppercase character), or a keyword dictionary that lists all of the unique values in that column. For example, if there's an additional evidence column for a person's city or residence, you can create a list with all the unique city names from the table and then use it to create a dictionary-based sensitive information type.

Use this SIT as the classification element for the corresponding column in your EDM SIT by exporting and editing the EDM SIT definition in XML. For more information, see Create a rule package manually.
SIT test function doesn't detect any matches at all. Verify that the SIT you selected includes requirements for additional keywords or other validations. For built-in SITs, see Sensitive information type entity definitions to determine what the minimum requirements are for matching each type.
The Test functionality works but your SharePoint or OneDrive items aren't being detected in DLP or auto-labeling rules Verify that the documents you expect to find matches in actually show up in content explorer. Matches are only detected in content that is created after changes to the SIT are applied. So, if expected matches don't appear, re-crawl the sites and libraries for any pre-existing items. For details on re-crawling SharePoint and OneDrive, see Manually request crawling and re-indexing of a site, a library or a list.
DLP or auto-labeling rules that require multiple matches don't trigger Make sure that the proximity requirements for both your EDM SIT and the base SITs are met. For example, if the maximum distance between the primary element and supporting keywords is 300 characters, but the keywords are only present in the first row of a long table, only the first few rows of matching values are likely to meet the proximity requirements. Modify your SIT definitions to support more relaxed proximity rules or use the Anywhere in the document option for the additional evidence conditions.
Detection of an EDM SIT is inconsistent or erratic Make sure that the SIT you used as the base for the primary element in your EDM SIT isn't detecting unnecessary content. Using a SIT that matches too much unrelated content, such as any word, any number, or all email addresses, can cause the service to ignore relevant matches. Check the number of content pieces that match the sensitive type you used for your primary elements in content explorer.

To estimate whether the SIT is matching too much content:
  1. Divide the number of content items in the Microsoft Purview Content Explorer by the number of days since the sensitive type was created.
  2. If the number of matches per day is in the range of hundreds of thousands or more, it's possible that the primary SIT is too broad.

For recommendations and best practices on selecting the right sensitive information type for an EDM SIT, see Learn about exact data match based sensitive information types.