Create a keyword dictionary

Article
03/27/2024

Microsoft Purview can identify, monitor, and protect your sensitive items. Identifying sensitive items sometimes requires looking for keywords, particularly when identifying generic content (such as healthcare-related communication), or inappropriate or explicit language. Although you can create keyword lists when you create custom sensitive information types, keyword lists are limited in size and if you are creating them in PowerShell, require modifying XML to create or edit them.

In contrast, keyword dictionaries provide simpler management of keywords and at a much larger scale, supporting up to 1 MB of terms (post-compression) in the dictionary. Additionally, keyword dictionaries can support any language. The tenant limit is also 1 MB after compression. A post-compression limit of 1 MB means that all dictionaries combined across a tenant can have close to one million characters.

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Keyword dictionary limits

You can create up to 50 sensitive information types (SITs) per tenant that used on keyword dictionaries. To find out how many keyword dictionaries you have in your tenant, follow the procedures in Connect to the Security & Compliance PowerShell to connect to your tenant and then run this PowerShell script:

$rawFile = $env:TEMP + "\rule.xml"

$kd = Get-DlpKeywordDictionary
$ruleCollections = Get-DlpSensitiveInformationTypeRulePackage
[System.IO.File]::WriteAllBytes((Resolve-Path $rawFile), $ruleCollections.SerializedClassificationRuleCollection)
$UnicodeEncoding = New-Object System.Text.UnicodeEncoding
$FileContent = [System.IO.File]::ReadAllText((Resolve-Path $rawFile), $unicodeEncoding)

if($kd.Count -gt 0)
{
$count = 0
$entities = $FileContent -split "Entity id"
for($j=1;$j -lt $entities.Count;$j++)
{
for($i=0;$i -lt $kd.Count;$i++)
{
$Matches = Select-String -InputObject $entities[$j] -Pattern $kd[$i].Identity -AllMatches
$count = $Matches.Matches.Count + $count
if($Matches.Matches.Count -gt 0) {break}
}
}

Write-Output "Total Keyword Dictionary SIT:"
$count
}
else
{
$Matches = Select-String -InputObject $FileContent -Pattern $kd.Identity -AllMatches
Write-Output "Total Keyword Dictionary SIT:"
$Matches.Matches.Count
}

Remove-Item $rawFile

Basic steps to creating a keyword dictionary

Most commonly you compile your keywords for your dictionary in a file, such as a .csv or .txt list. You upload the dictionary file into a SIT during creation or editing or import them via a PowerShell cmdlet. Alternatley, you can start from an existing or from an existing Keyword dictionary. Lastly, you can enter keywords manually in the Add keyword dictionary dialog. When you create a keyword dictionary, you follow the same core steps:

Create a keyword dictionary using the Microsoft Purview portal or the Microsoft Compliance portal

Use these steps to create or import keywords for a custom dictionary:

Select the appropriate tab for the portal you're using. To learn more about the Microsoft Purview portal, see Microsoft Purview portal. To learn more about the Compliance portal, see Microsoft Purview compliance portal.

Microsoft Purview portal
Compliance portal

Sign in to the Microsoft Purview portal Information Protection > Classifiers > Sensitive info types.
Select + Create sensitive info type and then enter a Name and Description for your sensitive info type. Choose Next.
On the Define patterns for this sensitive info type page, choose + Create pattern.
In the New pattern window, select a Confidence level.
Choose Add a Primary element and select Keyword dictionary.
On the Add a keyword dictionary flyout, you can:
1. Upload a dictionary file in TXT or CSV format.
2. Choose from existing dictionaries.
3. or create a new dictionary by entering keywords manually and giving it a name.
Still in the New Pattern window, for Character proximity, specify how far away (in number of characters) that any supporting elements must be to be detected. The closer the primary and supporting elements are to each other, the more likely the detected content is going to be what you're looking for.
Add the Supporting elements you wish to use to increase the accuracy of detecting what you're looking for.
Add any Additional checks and then choose Create.
Choose Next to continue creating your sensitive information type. When you are finished, choose Done.

Create a keyword dictionary from a file using PowerShell

Often when you need to create a large dictionary, it's so you can use keywords from a file or a list exported from some other source. In the example that follows, you'll create a keyword dictionary containing a list of diseases to screen in external email. To begin, you'll need to connect to Security & Compliance PowerShell.

Copy your keywords into a text file and make sure that each keyword is on a separate line.
Save the text file with Unicode encoding. In Notepad, navigate to > Save As > Encoding > Unicode.

Read the file into a variable by running this cmdlet:

$fileData = [System.IO.File]::ReadAllBytes('<filename>')

Create the dictionary by running this cmdlet:

New-DlpKeywordDictionary -Name <name> -Description <description> -FileData $fileData

Using keyword dictionaries in custom sensitive information types and DLP policies

Keyword dictionaries can be used as part of the match requirements for a custom sensitive information type, or as a sensitive information type themselves. Both require you to create a custom sensitive information type. Follow the instructions in the linked article to create a sensitive information type. Once you have the XML, you'll need the GUID identifier from the XML in order to use the dictionary.

<Entity id="9e5382d0-1b6a-42fd-820e-44e0d3b15b6e" patternsProximity="300" recommendedConfidence="75">
    <Pattern confidenceLevel="75">
        <IdMatch idRef=". . ."/>
    </Pattern>
</Entity>

To get the identity of your dictionary, run this command and copy the Identity property value:

Get-DlpKeywordDictionary -Name "Diseases"

The output of the command looks like this:

RunspaceId : 138e55e7-ea1e-4f7a-b824-79f2c4252255
Identity : 8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f
Name : Diseases
Description : Names of diseases and injuries from ICD-10-CM lexicon
KeywordDictionary : aarskog's syndrome, abandonment, abasia, abderhalden-kaufmann-lignac, abdominalgia, abduction contracture, abetalipo proteinemia, abiotrophy, ablatio, ablation, ablepharia,abocclusion, abolition, aborter, abortion, abortus, aboulomania, abrami's disease, abramo
IsValid : True
ObjectState : Unchanged

Paste the identity value into the XML for your custom sensitive information type as the idRef. Next, upload the XML file. Your dictionary now appears in your list of sensitive information types and you can use it right in your policy, specifying how many keywords are required to match.

<Entity id="d333c6c2-5f4c-4131-9433-db3ef72a89e8" patternsProximity="300" recommendedConfidence="85">
      <Pattern confidenceLevel="85">
        <IdMatch idRef="8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f" />
      </Pattern>
    </Entity>
    <LocalizedStrings>
      <Resource idRef="d333c6c2-5f4c-4131-9433-db3ef72a89e8">
        <Name default="true" langcode="en-us">Diseases</Name>
        <Description default="true" langcode="en-us">Detects various diseases</Description>
      </Resource>
    </LocalizedStrings>

Note

Microsoft 365 Information Protection supports double-byte character set languages for:

Chinese (simplified)
Chinese (traditional)
Korean
Japanese

This support is available for sensitive information types. See, Information protection support for double byte character sets release notes (preview) for more information.

Tip

To detect patterns containing Chinese/Japanese characters and single byte characters or to detect patterns containing Chinese/Japanese and English, define two variants of the keyword or regex.

For example, to detect a keyword like "机密的document", use two variants of the keyword; one with a space between the Japanese and English text and another without a space between the Japanese and English text. So, the keywords to be added in the SIT should be "机密的 document" and "机密的document". Similarly, to detect a phrase "東京オリンピック2020", two variants should be used; "東京オリンピック 2020" and "東京オリンピック2020".

Along with Chinese/Japanese/double byte characters, if the list of keywords/phrases also contains non-Chinese/Japanese words also (for instance, stand-alone English words), you should create two dictionaries/keyword lists. One for keywords containing Chinese/Japanese/double byte characters and another one for English words.

For example, if you want to create a keyword dictionary/list with three phrases "Highly confidential", "機密性が高い" and "机密的document", the it you should create two keyword lists.
1. Highly confidential
2. 機密性が高い, 机密的document and 机密的 document

While creating a regex using a double byte hyphen or a double byte period, make sure to escape both the characters like one would escape a hyphen or period in a regex. Here is a sample regex for reference:

(?<!\d)([4][0-9]{3}[\-?\-\t]*[0-9]{4}

We recommend using a string match instead of a word match in a keyword list.

Share via