Export source data for exact data match based sensitive information types

Article
12/11/2023

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

The sensitive data table is a text file containing rows of values against which you compare the content in your documents to identify sensitive data. These values might be personally identifiable information, product records, or other sensitive data in text form that you want to detect in your content and protect.

Once you export the data in your table (in one of the supported formats), you can create an EDM schema.

Defining your EDM Sensitive type

When you define your EDM sensitive type, one of the most critical decisions is to define which fields are your primary fields. Primary fields need to follow a detectable pattern and be defined as searchable fields (columns) in your EDM schema. Secondary fields don't need to follow any pattern since they'll be compared against all the text surrounding matches to the primary fields.

Use these rules to help you decide which columns you should use as primary fields:

If you must detect sensitive data based on the presence of a single value matching a field in your sensitive data table, regardless of the presence of any other sensitive data surrounding it, that column must be defined as a primary element for an EDM SIT.
If multiple combinations of different fields in your sensitive data table must be detected in content, identify the columns that are common to most such combinations and designate them as primary elements. Designate combinations of the other fields as secondary elements.
If a column you want to use as a primary element doesn't follow a detectable pattern, like any text string or follows detectable patterns that would be present somewhere in a large percentage of documents or emails, choose other, better structured, columns as primary elements.

For example, if you have the columns full name, date of birth, account number, and Social Security Number, even if the first and last names are the columns that are common to the different combinations of data you want to detect, such strings don’t follow patterns that are easily identifiable and might be difficult to define as a sensitive information type. There are a number of reasons for this:

some names might not start with an uppercase character
some might be formed by two, three,or more words/strings
some might contain numbers or other non-alphabetical characters. Dates of birth can be identified more easily but, since every email and most documents will contain at least one date, a DateOfBirth field is also not a good candidate. Instead, use fields such as Social Security numbers and account numbers, which are good candidates for primary fields.

Sample file templates

To make selecting your primary fields easier, we've put together some sample file templates for:

These are comma separated value (.csv) files that have the most commonly used values across those industry verticals as column headers, along with Microsoft-generated synthetic values in the rows. Use the column headers to help you decide on your primary fields. Best practice is to export only the source data that is required. The column headers suggest the most relevant fields.

To learn how to use the sample file templates, see How to use the sample file templates.

Save sensitive data in .csv, .tsv, or pipe-separated format

Identify the sensitive information you want to use. Export the data to an app such as Microsoft Excel and save the file as a text file. The file can be saved in any of the following formats: .csv (comma-separated values), .tsv (tab-separated values), or (|)(pipe-separated) format. The .tsv format is recommended in cases where your data values might include commas, such as street addresses. The data file can include a maximum of:
- Up to 100 million rows of sensitive data
- Up to 32 columns (fields) per data source
- Up to 10 columns (fields) marked as searchable
Structure the sensitive data in the .csv or .tsv file such that the first row includes the names of the fields used for EDM-based classification. In your file you might have field names such as "ssn", "birthdate", "firstname", "lastname". The column header names can't include spaces or underscores. For example, the sample .csv file that we use in this article is named PatientRecords.csv, and its columns include PatientID, MRN, LastName, FirstName, SSN, and more.
Pay attention to the format of the sensitive data fields; in particular, fields that might contain commas in their content. For example, a street address that contains the value "Seattle, WA" would be parsed as two separate fields if the .csv format is selected. To avoid this, use the .tsv format or surrounded the comma containing values by double quotes in the sensitive data table. If comma containing values also contain spaces, you need to create a custom SIT that matches the corresponding format. For example, a SIT that detects multi-word string with commas and spaces in it.

Next step

For new experience: Create EDM SIT sample file for the new experience

For classic experience: Create the schema for exact data match based sensitive information types

Share via