Learn about using regular expressions (regex) in data loss prevention policies
სტატია
A regular expression, commonly referred to as a regex, is a sequence of characters that defines a search pattern. Regular expressions are primarily used for pattern matching with strings and in string matching; for example, in "find and replace" operations. You can use a regex in Microsoft Purview Data Loss Prevention (DLP) to define patterns that help you identify and classify sensitive data, or to help detect patterns in content. The most common regex uses in Microsoft Purview DLP are:
Leveraging the SubjectOrBodyMatchesPatterns condition in a DLP rule (Read more here.)
This article describes common issues that occur when working with regular expressions and how you can resolve them.
Potential validation issue when using a regex with DLP
Basic units of the pattern, such as literal characters, digits, whitespace, and punctuation marks can be represented
by themselves or by special symbols called metacharacters, such as \d for any digit, \s for any
whitespace, or \. for a literal dot.
The basic units, when combined with quantifiers, specify how many times they can or must occur in a match. For example, * means zero or more, + means one or more, ? means zero or one, and {n,m} means between nand m times. For example, \d+ means one or more digits, \s? means optional whitespace, and a{3,5} means between three and five instances of the literal character a.
A regex either uses a positive lookbehind or a negative lookbehind. A lookbehind is used to check whether there's a match before a certain position in the input string, without including the actual characters in the match. A positive lookbehind is used to match when the lookbehind pattern is present, while a negative lookbehind is used to match when the lookbehind pattern is not present.
Consider this example: (?<=^|\s|_). This example shows a lookbehind that includes three possibilities:
^ asserts the position. In this case, it requires the pattern matching to begin at the start of the line.
\s detects any whitespace characters as a match.
_ matches the literal underscore character ( _ ).
In the previous example, possibilities #2 and #3 will each match a single character. However, possibility #1 only indicates where the matching should start. It will not produce results with respect to any character matches.
Take a second example, ^\d+$. This regex will only detect a string composed entirely of digits, from start to end.
How to get extracted text
A regex is matched on the extracted text of the content, rather than on the content itself. So, even when the pattern appears to be on the content, it might not match when evaluating a DLP policy.
To ensure that you capture the appropriate matches, take the following steps:
Use the Test-TextExtraction cmdlet to get the extracted text, which will consist of a stream of strings.
Next, use the extracted text for matching the regular expression.
How to verify sensitive information type detection
To verify sensitive information type (SIT) detection, we need to take the text we just extracted and then run the Test-DataClassification cmdlet on it to verify detection. The results of running the cmdlet will indicate whether there are any SIT matches for the regex.
While it seems that the regex will detect a match with this mail item, the extracted text looks like this:
Regex Test Email ABC123
As you can see, the extracted text begins with the content in the subject line of the email, rather than with the content in the body of the email. However, the inclusion of the assertion character, ^, at the start of the regex requires that the ABC... string must be at the start of the extracted text for a match to be detected.
To resolve this issue, you can change the regex to ABC\d.
This module examines the data loss prevention features in Microsoft 365 that help organizations identify, monitor, report, and protect sensitive data through deep content analysis while helping users understand and manage data risks.