Learn about using regular expressions (regex) in data loss prevention policies

სტატია
01/21/2025

A regular expression, commonly referred to as a regex, is a sequence of characters that defines a search pattern. Regular expressions are primarily used for pattern matching with strings and in string matching; for example, in "find and replace" operations. You can use a regex in Microsoft Purview Data Loss Prevention (DLP) to define patterns that help you identify and classify sensitive data, or to help detect patterns in content. The most common regex uses in Microsoft Purview DLP are:

Defining a custom sensitive information types.
Leveraging the SubjectOrBodyMatchesPatterns condition in a DLP rule (Read more here.)

This article describes common issues that occur when working with regular expressions and how you can resolve them.

Potential validation issue when using a regex with DLP

Basic units of the pattern, such as literal characters, digits, whitespace, and punctuation marks can be represented by themselves or by special symbols called metacharacters, such as \d for any digit, \s for any whitespace, or \. for a literal dot.
The basic units, when combined with quantifiers, specify how many times they can or must occur in a match. For example, * means zero or more, + means one or more, ? means zero or one, and {n,m} means between nand m times. For example, \d+ means one or more digits, \s? means optional whitespace, and a{3,5} means between three and five instances of the literal character a.
A regex either uses a positive lookbehind or a negative lookbehind. A lookbehind is used to check whether there's a match before a certain position in the input string, without including the actual characters in the match. A positive lookbehind is used to match when the lookbehind pattern is present, while a negative lookbehind is used to match when the lookbehind pattern is not present.
Consider this example: (?<=^|\s|_). This example shows a lookbehind that includes three possibilities:
1. ^ asserts the position. In this case, it requires the pattern matching to begin at the start of the line.
2. \s detects any whitespace characters as a match.
3. _ matches the literal underscore character ( _ ).
In the previous example, possibilities #2 and #3 will each match a single character. However, possibility #1 only indicates where the matching should start. It will not produce results with respect to any character matches.
Take a second example, ^\d+$. This regex will only detect a string composed entirely of digits, from start to end.

How to get extracted text

A regex is matched on the extracted text of the content, rather than on the content itself. So, even when the pattern appears to be on the content, it might not match when evaluating a DLP policy.

To ensure that you capture the appropriate matches, take the following steps:

Use the Test-TextExtraction cmdlet to get the extracted text, which will consist of a stream of strings.
Next, use the extracted text for matching the regular expression.

For instance:

PowerShell

$data = ([System.IO.File]::ReadAllBytes('<FilePath>'))
$tr = Test-TextExtraction -FileData $data
$tr.ExtractedResults.ExtractedStreamText | Format-List

How to verify sensitive information type detection

To verify sensitive information type (SIT) detection, we need to take the text we just extracted and then run the Test-DataClassification cmdlet on it to verify detection. The results of running the cmdlet will indicate whether there are any SIT matches for the regex.

For example:

PowerShell

$textStream = $tr.ExtractedResults.ExtractedStreamText | Out-String
$result = Test-DataClassification -TextToClassify $textStream
$result.ClassificationResults | Format-List

Example of using a regex in a DLP policy rule

In this example, we will block email that contains strings starting with ABC followed by a number.

Regex used: ^ABC\d

DLP rule sample: New-DlpComplianceRule -Name "Rule_00" -Policy "Policy_00" -SubjectOrBodyMatchesPatterns "^ABC\d" - BlockAccess $True

Sample email

sample email for regex matching

While it seems that the regex will detect a match with this mail item, the extracted text looks like this:

Regex Test Email ABC123

As you can see, the extracted text begins with the content in the subject line of the email, rather than with the content in the body of the email. However, the inclusion of the assertion character, ^, at the start of the regex requires that the ABC... string must be at the start of the extracted text for a match to be detected.

To resolve this issue, you can change the regex to ABC\d.

გაზიარება არხიდან:

Learn about using regular expressions (regex) in data loss prevention policies

Potential validation issue when using a regex with DLP

How to get extracted text

How to verify sensitive information type detection

Example of using a regex in a DLP policy rule

Learn more about regular expressions

გამოხმაურება

დამატებითი რესურსები