MS Purview Classifications with Regex

Question

MS Purview Classifications with Regex

Pedro Simões 20

I am working with custom classification rules in Microsoft Purview and would like to clarify how regex-based column and data patterns are evaluated during scans. I have several related questions and observations.

1. Column Pattern Regex Evaluation

When a custom classification rule is applied during a Purview scan, does the column pattern regex evaluate against:

the column name, or
the column displayName, or
both?

This distinction is important because I am working with data sources where the column name and display name differ.

2. Minimum Match Threshold and Sparse Columns

The minimum Data Pattern Match Threshold appears to be 1%. Is there any supported way to classify a column when the actual data match percentage is below 1%?

For example, if a column contains 999 NULL values and only 1 populated value, the effective match rate would be far below 1%. Even with a very permissive data pattern such as .*, it seems that the column would not be classified due to the threshold limitation. Is there any workaround or alternative approach for handling very sparse columns? I am asking this because I want to classify PII and even if 1 there is only 1 value in 1000 it is important to classify this column as PII.

3. Unexpected Classification Behavior with Column Patterns

I expected a column, from a Dataverse table, named iss_gendername to be classified by a custom rule configured as follows:

Column pattern: (?i)^.*gender.*

Data pattern: .*

Minimum match threshold: 1%

However, this column was not classified during the scan.

Notably:

The rule passes the “Test classification rule” feature when using a separate manually built CSV file containing a column named iss_gendername.

Another column named Gender was classified by the same rule.

I observed that both columns share the same Fully Qualified Name (FQN) in Purview.

This raised the question of whether having multiple columns with the same FQN can prevent one of them from being classified or if it has anything to do with the way this column exists inside Dataverse.

Any clarification on these behaviors, particularly around regex evaluation targets, minimum match thresholds, and FQN handling, would be greatly appreciated.

Thank you.

1 answer

Your answer

Answer 1

Hi Pedro Simões,
It sounds like you have several detailed questions regarding custom classification rules in Microsoft Purview. Let’s break down your inquiries:

Column Pattern Regex Evaluation: When applying custom classification rules, the regex evaluates against both the column name and the column display name. If they differ, both are checked to ensure maximum accuracy.

Minimum Match Threshold and Sparse Columns: You're correct that the minimum threshold for a data pattern match is typically set at 1%. Unfortunately, if the match rate falls below this threshold, the column cannot be classified automatically. A workaround could be to ensure that you have more populated rows in your data, as columns with a significant number of NULL values are less likely to be classified. One alternative approach is to manually apply the classification if you deem any populated cells as essential for classification, especially for PII.

Unexpected Classification Behavior: The behavior you described regarding the column iss_gendername not being classified can be influenced by the column sharing the same Fully Qualified Name (FQN). This can indeed affect classification, as Purview may get confused due to duplicate FQNs. It’s worth checking how these columns are defined within Dataverse, as it may affect the recognition and classification during scans.

Here's a quick recap of steps you might consider:

Double-check the regex patterns to ensure they are accurate and correctly formatted.
Review the data in your columns to ensure they meet the distinct value requirements.
Inspect if your scan rule set includes all necessary custom classifications.
If issues persist, consider classifying columns manually where potential classification is known.

If you have any further questions or need clarification on any of these points, feel free to ask!

References:

Hope this helps!

Share via

MS Purview Classifications with Regex

1. Column Pattern Regex Evaluation

2. Minimum Match Threshold and Sparse Columns

3. Unexpected Classification Behavior with Column Patterns

1 answer

Your answer