Test an exact data match sensitive information type

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

After your exact data match (EDM) sensitive information type (SIT) has been created, and an hour after verifying that your sensitive information table has finished uploading and indexing, you can test that it detects the information you want to detect by using the test function in the sensitive information types section in the Compliance center.

Note

Changes in an already created EDM SIT can take some time to propagate across the system. If you are making changes in an EDM sensitive information type for troubleshooting detection issues, make sure to wait at least one hour after making those changes before using the test function to validate their impact.

Regardless of which method you use for testing, the test results will include both matches for the specific EDM SIT and matches for the primary elements that are configured for that EDM SIT.

Test your EDM SIT in the Compliance Center

In the new experience, there are two ways to access the EDM SIT test functionality in the Microsoft Purview compliance portal: through the Sensitive Information Types path or through the EDM classifiers path.

In the classic experience, testing your EDM SITs requires working through the Sensitive Information Types path.

Testing an EDM SIT through the Sensitive Information Types path

To test an EDM SIT through the Sensitive Information Types path, take the following steps.

  1. Open the Compliance center > Data classification > Classifiers and then Sensitive Information Types.

  2. Select your EDM SIT from the list and then select the Test icon.

  3. In the flyout pane, upload a file that contains data you want to detect. For example, create a file that contains a subset of the rows in your sensitive information table. If you used the configurable match feature in your schema to define ignored delimiters, make sure the file includes examples with and without those delimiters.

  4. Choose Test.

  5. After the file has been uploaded and scanned, check for matches to your EDM SIT.

  6. If the Test function in the SIT detects a match, validate that it isn't trimming it or extracting it incorrectly. For example by extracting only a substring of the full string it's supposed to detect, or picking up only the first word in a multi-word string, or including extra symbols or characters in the extraction. See Regular Expression Language - Quick Reference for the regular expression language reference.

Test your EDM SIT through the EDM Classifiers path

  1. Open the Compliance center > Data classification > Classifiers and then EDM Classifiers.

  2. Make sure that the New EDM experience toggle is set to On.

  3. Select your EDM SIT from the list and then select the Test icon.

  4. Upload a file that contains data you want to detect. For example, create a file that contains a subset of the rows in your sensitive information table. If you used the configurable match feature in your schema to define ignored delimiters, make sure the file includes examples with and without those delimiters.

  5. After the file has been uploaded and scanned, check for matches to your EDM SIT.

  6. If the Test function in the SIT detects a match, validate that it isn't trimming it or extracting it incorrectly. For example by extracting only a substring of the full string it's supposed to detect, or picking up only the first word in a multi-word string, or including extra symbols or characters in the extraction. See Regular Expression Language - Quick Reference for the regular expression language reference.

Test your EDM SIT using PowerShell

To test using PowerShell, use the following PowerShell cmdlet:

Test-DataClassification  -ClassificationNames “[Your EDM sensitive info type]” -TexttoClassify “[your own text to scan for matches]” 

Note

Regardless of the method you use for testing, the test results will include both matches for the specific EDM SIT and matches for the primary elements that are configured for that EDM SIT.

Note

When you create a or edit an EDM sensitive information type or the primary SIT on which an EDM type is based, all new content--as well as content that is modified after the changes to the SITs--will be crawled for text that matches the new definitions. However, preexisting content won’t be crawled until it is modified or reindexed.

To force re-crawling of existing content in a SharePoint site or library or in OneDrive, follow the instructions in Manually request crawling and re-indexing of a site, a library or a list.

Test your EDM SIT with information protection policies

You can see where your EDM SIT is being used and how accurate it is in production by using it in policies:

  1. Create an auto-labeling policy and run it in Simulation overview.

  2. Add some content that will trigger the EDM SIT and some content that won't trigger the EDM SIT to a location that your policy is monitoring.

  3. Open the Items to review tab to check the matches.

  4. Tune your policies as appropriate.

Once you're satisfied with the results of your testing and tuning, your EDM-based custom SIT is ready for use in information protection policies, for instance:

Troubleshooting tips

If you don't find any matches, here are some troubleshooting tips.

Issue Troubleshooting tip
No matches found Confirm that your sensitive data was uploaded correctly using the commands explained in Hash and upload the sensitive information source table for exact data match sensitive information types
No matches found Test the SIT you used when you configured the primary element in each of your patterns. This will confirm that the SIT is able to match the examples in the item. Using an incorrectly defined SIT as the classification element of an EDM sensitive information type is the most common cause for detection failures in EDM.
The SIT you selected for a primary element in the EDM type doesn't find a match in the item or finds fewer matches than you expected Check that it supports the separators and delimiters that are in the content. Be sure to include the ignored delimiters defined in your schema.
The primary element SIT finds matches in an item, but the EDM SIT doesn't. - Check your REGEX statements for starting or ending capturing whitespace delimiters, like \s. The whitespace won't match the hashed value in the data table. Use a word delimiter like \b instead.
- Check your REGEX statements to ensure that they capture the whole string you want to capture, not just a substring. For example, this pattern for email addresses \b[a-zA-Z]{2,30}@[a-zA-Z]{2,20}.[a-zA-Z]{2,3}\b will correctly match user@contoso.com but will only capture user@contoso.co.jp in incomplete form.
An EDM SIT with primary elements and no secondary elements defined detects items, but doesn't detect (or detects fewer matches than expected) when primary and secondary elements are required. If values in a column used for secondary evidence are not composed of a single word or strings that don't contain spaces, commas, or other word separators, you will need to associate them with a SIT that uses either a REGEX designed to detect multi-word strings that follow the desired pattern (e.g. a fixed number of consecutive words that start with an uppercase character), or a keyword dictionary that lists all of the unique values in that column. For example, if there's an additional evidence column for a person's city or residence, you can create a list with all the unique city names from the table and use it to create a dictionary-based sensitive information type. Use this SIT as the classification element for the corresponding column in your EDM sensitive info type by exporting and editing the EDM SIT definition in XML. See Create a rule package manually.
SIT test function doesn't detect any matches at all. Check if the SIT you selected includes requirements for additional keywords or other validations. For the built-in SITs, see Sensitive information type entity definitions to verify what the minimum requirements are for matching each type.
The Test functionality works but your SharePoint or OneDrive items aren't being detected in DLP or auto-labeling rules Check if the documents you would expect to match show up in Content Explorer. If they aren't there, remember that only content created after the changes to the sensitive information type will show as matches. You have to recrawl the sites and libraries for pre-existing items to show up. See Manually request crawling and re-indexing of a site, a library or a list for details on recrawling SharePoint and OneDrive.
DLP or auto-labeling rules that require multiple matches don't trigger Check that the proximity requirements for both your EDM type and the base sensitive information types are met. For example, if the maximum distance between the primary element and supporting keywords is 300 characters, but the keywords are only present in the first row of a long table, only the first few rows of matching values are likely to meet the proximity requirements. Modify your SIT definitions to support more relaxed proximity rules or use the anywhere in the document option for the additional evidence conditions.
Detection of an EDM type is inconsistent or erratic Check that the sensitive information type you used as the base for the primary element in your EDM type isn't detecting unnecessary content. Using a SIT that matches too much unrelated content, like any word, any number, or all email addresses, might cause the service to saturate and ignore relevant matches. Check the number of content pieces that match the sensitive type you used for your primary elements in content explorer.
To estimate if the SIT is matching too much content:
- Dividing the number of content items in Content Explorer by the number of days since the sensitive type was created.
- If the number of matches per day is in the range of hundreds of thousands or millions, it's possible that the primary SIT is too broad. See Learn about exact data match based sensitive information types for recommendations and best practices on selecting the right sensitive information type for an EDM type.