Purview confidence levels with SSN

Question

Purview confidence levels with SSN

M.Hath 1

Using Purview for DLP, testing the medium confidence policy for SSN.

It seemed to work well enough, but found that some test numbers were being allowed and not alerted.

888-88-8888
123456789
987654321
All using variations of dashes, spaces, no spaces - either in email body or as attachment.

Some were caught but most were not. These were ones not caught:
888-88-8888
888888888
123-45-6789
123 45 6789
123456789

As mentioned above, we have policy set at Medium:

Func_unformatted_ssn finds SSNs with pre-2011 strong formatting that are unformatted as nine consecutive digits (ddddddddd)

A DLP policy has medium confidence that it's detected this type of sensitive information if, within a proximity of 300 characters:

The function Func_unformatted_ssn finds content that matches the pattern.
A keyword from Keyword_ssn is found.As mentioned above, we have policy set at Medium:
- Func_unformatted_ssn finds SSNs with pre-2011 strong formatting that are unformatted as nine consecutive digits (ddddddddd)
A DLP policy has medium confidence that it's detected this type of sensitive information if, within a proximity of 300 characters:
- The function Func_unformatted_ssn finds content that matches the pattern.
- A keyword from Keyword_ssn is found.

Do I need to increase it to Low or High confidence to adjust to catch these other SSN tests?

M.Hath 1 Reputation point

2025-05-07T19:47:56.0033333+00:00

I was afraid I would have to adjust to low confidence.
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-07T20:50:12.2733333+00:00

@M.Hath

Yes, that’s understandable, adjusting to low confidence can sometimes raise concerns about increased false positives. However, based on how Microsoft Purview DLP evaluates SSNs, using medium confidence requires a match to the unformatted SSN pattern plus a keyword like “SSN” or “Social Security” within 300 characters, as documented here.

If your test data lacks those keywords, even valid-looking SSNs won’t trigger the policy at medium confidence.

I hope this information helps.

Thank you!
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-08T17:35:33.09+00:00

@M.Hath

Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

M.Hath 1 Reputation point

2025-05-07T19:47:56.0033333+00:00

I was afraid I would have to adjust to low confidence.
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-07T20:50:12.2733333+00:00

@M.Hath

Yes, that’s understandable, adjusting to low confidence can sometimes raise concerns about increased false positives. However, based on how Microsoft Purview DLP evaluates SSNs, using medium confidence requires a match to the unformatted SSN pattern plus a keyword like “SSN” or “Social Security” within 300 characters, as documented here.

If your test data lacks those keywords, even valid-looking SSNs won’t trigger the policy at medium confidence.

I hope this information helps.

Thank you!
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-08T17:35:33.09+00:00

@M.Hath

Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

@M.Hath

Based on Microsoft's official documentation for Sensitive Information Types (SIT),

Detection of U.S. Social Security Numbers (SSNs) in Microsoft Purview DLP depends heavily on the confidence level and specific formatting and proximity conditions.

Why your SSNs may not be detected under Medium confidence

Your DLP policy uses the Medium confidence level, which requires - A match to the Func_unformatted_ssn (i.e., 9 consecutive digits). And a keyword (from Keyword_ssn) to appear within 300 characters of the number.

So, formats like 123456789 or 123-45-6789 will not be detected under Medium confidence unless a keyword like “SSN” or “Social Security” appears nearby in the body or attachment.

Also, numbers like 888-88-8888 and 987-65-4321 are commonly used test SSNs and may be deliberately excluded from detection by Microsoft’s default patterns.

Options for better detection:

If you want to detect SSNs without needing nearby keywords, consider changing the policy to Low confidence. This level may detect SSNs based only on their pattern match, but it can result in more false positives.

If you're concerned about strict detection, raising the level to High confidence won’t help in this case—it requires more conditions to be met (e.g., stricter pattern + keyword).

For more control, you can also create a custom sensitive information type (SIT) to match specific SSN formats and decide whether to include or exclude proximity keyword requirements.

Note: Your current policy works as designed but misses some test data due to formatting and keyword proximity conditions at the Medium level. Switching to Low confidence or defining a custom SIT would help in detecting more variations of SSNs.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-05-12T08:19:56.07+00:00

@M.Hath

Following up to see if the provided solution helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Purview confidence levels with SSN

1 answer

Your answer