What REGEX can I use to detect a UPN being sent in Email/Shared Document in Onedrive/Sharepoint

curious7 271 Reputation points
2025-02-18T12:53:45.2066667+00:00

In Microsoft Purview Information Protection I need to create a REGEX for a sensitive info type that will detect if a UPN is being sent in email or shared with external users in a document.

I created a primary element with following for single level (Eg - user@localhost) and 2 level domains (Eg - ******@domain.com):-
Single level- <?\w+?.?\w+@\w+>?

2 level- <?\w+?.?\w+@\w+.\w+>?

I have added Secondary element to match minimum of 1 domain from our domain list (keyword List).

And then another secondary element to not match following REGEX element (as I don't want to match something like this which is used when replying to any email "<******@domain.com":-
Single level- <\w+?.?\w+@\w+>

2 level- <\w+?.?\w+@\w+.\w+>

Also, I added additional checks for this because I don't want to catch email address in the format "<******@domain.com" while replying to any email:

  • "not start with" - "<"
  • "not ends with" - ">"

But if a user responds to external user then it still ends up catching the UPN inside the less than and greater than sign in the following string - "******@domain.com". Because "******@domain.com" will come up in all email replies to external user, so I don't want to catch it with the SIT.

What am I doing wrong and how can I achieve this? This SIT will be used inside DLP policy.

Microsoft Security | Microsoft Purview
{count} votes

Accepted answer
  1. Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator
    2025-02-18T18:34:39.4066667+00:00

    Hi @curious7

    What am I doing wrong and how can I achieve this?

    You're on the right track but need to refine your regex and logic to avoid false positives, especially in email reply scenarios. Here’s a refined approach to help you achieve this:

    Basic UPN Detection - To match standard UPNs in email format, we can use the following regex pattern: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b

    This pattern matches typical email addresses (UPNs), ensuring that they follow the general format of ******@domain.com.

    Avoid Matching Within Angle Brackets (< >) - To ensure that we don’t capture email addresses within angle brackets (such as when replying to emails), we can add negative lookahead and lookbehind assertions: (?<!<)\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b(?!>)

    Here:

    (?<!<) ensures that the email is not preceded by a < (i.e., not in reply format).

    (?!>) ensures that the email is not followed by a > (i.e., not in a quoted reply format).

    This effectively prevents matching email addresses like <******@domain.com>.

    Domain Whitelist Filtering - If you need to restrict matches to specific domains, you can extend the regex to only capture emails from your allowed domain list:

    (?<!<)\b[A-Za-z0-9._%+-]+@(yourdomain1\.com|yourdomain2\.org|yourdomain3\.edu)\b(?!>)

    This will match UPNs only from the specified domains (replace with your actual domains) and still avoid false positives from email replies.

    Implementation in Microsoft Purview - In Microsoft Purview, you can use the above regex patterns in your sensitive information types (SITs) within DLP policies to identify UPNs being sent via email or shared documents. Additionally, if domain filtering is needed, you can either handle it within the regex (as shown above) or use a separate condition in the policy for more flexibility.

    Testing and Validation - It's important to test the regex with various email formats to ensure it matches the desired UPNs and excludes those in reply format. This will help you fine-tune the solution based on your exact needs.

    For more details:

    Hope this helps. Do let us know if you have any further queries.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.