Purview confidence levels with SSN

M.Hath 1 Reputation point
2025-05-07T19:11:13.4566667+00:00

Using Purview for DLP, testing the medium confidence policy for SSN.

It seemed to work well enough, but found that some test numbers were being allowed and not alerted.

888-88-8888
123456789
987654321
All using variations of dashes, spaces, no spaces - either in email body or as attachment.

Some were caught but most were not. These were ones not caught:
888-88-8888
888888888
123-45-6789
123 45 6789
123456789

As mentioned above, we have policy set at Medium:

  • Func_unformatted_ssn finds SSNs with pre-2011 strong formatting that are unformatted as nine consecutive digits (ddddddddd)

A DLP policy has medium confidence that it's detected this type of sensitive information if, within a proximity of 300 characters:

  • The function Func_unformatted_ssn finds content that matches the pattern.
  • A keyword from Keyword_ssn is found.As mentioned above, we have policy set at Medium:
    • Func_unformatted_ssn finds SSNs with pre-2011 strong formatting that are unformatted as nine consecutive digits (ddddddddd)
    A DLP policy has medium confidence that it's detected this type of sensitive information if, within a proximity of 300 characters:
    • The function Func_unformatted_ssn finds content that matches the pattern.
    • A keyword from Keyword_ssn is found.

Do I need to increase it to Low or High confidence to adjust to catch these other SSN tests?

Microsoft Purview
Microsoft Purview
A Microsoft data governance service that helps manage and govern on-premises, multicloud, and software-as-a-service data. Previously known as Azure Purview.
1,585 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Ganesh Gurram 7,025 Reputation points Microsoft External Staff Moderator
    2025-05-07T19:40:29.1866667+00:00

    @M.Hath

    Based on Microsoft's official documentation for Sensitive Information Types (SIT),

    Detection of U.S. Social Security Numbers (SSNs) in Microsoft Purview DLP depends heavily on the confidence level and specific formatting and proximity conditions.

    Why your SSNs may not be detected under Medium confidence

    Your DLP policy uses the Medium confidence level, which requires - A match to the Func_unformatted_ssn (i.e., 9 consecutive digits). And a keyword (from Keyword_ssn) to appear within 300 characters of the number.

    So, formats like 123456789 or 123-45-6789 will not be detected under Medium confidence unless a keyword like “SSN” or “Social Security” appears nearby in the body or attachment.

    Also, numbers like 888-88-8888 and 987-65-4321 are commonly used test SSNs and may be deliberately excluded from detection by Microsoft’s default patterns.

    Options for better detection:

    If you want to detect SSNs without needing nearby keywords, consider changing the policy to Low confidence. This level may detect SSNs based only on their pattern match, but it can result in more false positives.

    If you're concerned about strict detection, raising the level to High confidence won’t help in this case—it requires more conditions to be met (e.g., stricter pattern + keyword).

    For more control, you can also create a custom sensitive information type (SIT) to match specific SSN formats and decide whether to include or exclude proximity keyword requirements.

    Note: Your current policy works as designed but misses some test data due to formatting and keyword proximity conditions at the Medium level. Switching to Low confidence or defining a custom SIT would help in detecting more variations of SSNs.

    I hope this information helps.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues. 

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.