Ask: I need assistance in creating a functional Sensitive Info Type (SIT) to prevent the unintended or unauthorized sharing of South African ID numbers. I’ve tried using the existing SIT, but it doesn't detect the ID numbers during testing. Additionally, I attempted to create a custom SIT using a regular expression, and although I've found some expressions that are typically used to validate South African ID numbers, they aren't working as expected. The regular expressions work fine on https://regex101.com, but when I apply them in the SIT configuration, they fail.
**Solution:**After further investigation, I discovered that Microsoft Purview utilizes a subset of .NET regex, which can cause certain features to behave differently. For instance, I had to avoid using \b (word boundary) and instead opted for (?<!\d) as an opener and (?!\d) as a closer for the regex.
Here’s an example comparing the original regex we use on Mimecast and the modified version for Purview:
Original Regex (Mimecast):
\b(([0-57-9]\d(0[1-9]|1[012]))|(61-9)|(601[012]))(0[1-9]|[12][0-9]|3[01])[ -]?\d\d\d\d[ -]?\d\d\d\b
Modified Regex (Purview):
(?<!\d)(([0-57-9]\d(0[1-9]|1[0-2]))|(61-9)|(601[0-2]))(0[1-9]|[12][0-9]|3[01])[ -]?\d{4}[ -]?\d{3}(?!\d)
The modified regex successfully worked in my SIT environment and can distinguish between valid and invalid ID numbers.
