Del via


Create an EDM SIT sample file (New experience)

Creating and making an exact data match (EDM) based sensitive information type (SIT) available is a multi-phase process. They can be used in Microsoft Purview data loss prevention policies, eDiscovery and certain content governance tasks.

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

  • New experience

If you want to create an EDM SIT using the classic experience see, Create EDM SIT classic experience.

Before you begin

Formatting the sample file

The system will extract the column names from the sample file to create the schema, and will recommend base SITs to map the sample field data to. It must be formatted identically to your source sensitive information table file and should contain synthetic values that are representative of your actual data. The file can be saved in .csv (comma-separated values), .tsv (tab-separated values), or pipe-separated (|) format, but should be the same as your actual source sensitive information table file. The .tsv format is recommended in cases where your data values include commas, such as street addresses.

  • Use about 10-20 rows of data to ensure that the system has enough samples to work with.
  • Field values that contain commas must be enclosed in quotes ".
  • The first row must be the header row and contain column names.
  • The file must contain at least one row of data.
  • Each row of data must contain the correct number of fields, corresponding to the headers.
  • The sample file can contain up to 32 columns.
  • The sample file can't exceed 2.5 MB in size.
  • Column (field) names must start with a letter, be at least three characters long, and consist of only alphanumeric characters (A-Z, a-z, 0-9) and can’t include spaces, underscores, or other special characters.

For example, if your actual data uses tab delimited (.tsv) format and looks like this:

image showing a tab separated table with four columns and three rows of data of artificial real data

Then your sample file must have the same column headers, but use synthetic values for the rows, like this

FirstName LastName PatientNumber CreditCardNumber
Eric Solomon 987-65-4321 9000000000000000
Lisa Taylor 123-45-6789 500000000000000
Andre Lawson 234-56-7890 200000000000000

How to use the sample file templates

If you're in the U.S. Healthcare, U.S. Financial Services, or U.S. Insurance industry verticals, you can start with the following sample file templates to speed up the sample file creation process. These files contain the most commonly used column headers across the respective industries as a well as synthetic values in the fields.

To use these templates:

  1. Download the sample file template for your industry.
  2. Compare the column headers in the template to your actual source data and pick the ones you want to use as primary fields in your customized sample file.
  3. Compare the formatting of your actual source data with the formatting of the synthetic values. Change the formatting of the synthetic values to match the formatting of your source data values.
  4. Save your customized sample file to use when you create EDM SIT schema and rule package.

Tip

When working in the new experience, you have the option to upload a sample file or enter the sample file values manually. We recommend creating the sample file.

Next step