Hash and upload the sensitive information source table for exact data match sensitive information types

This article shows you how to hash and upload your sensitive information source table.

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

Hash and upload the sensitive information source table

In this phase, you:

  1. Set up a custom security group and user account.
  2. Set up the EDM Upload Agent tool.
  3. Use the EDM Upload Agent tool to hash, with a salt value, the sensitive information source table, and upload it.

The hashing and uploading can be done using one computer or you can separate the hash step from the upload step for greater security.

If you want to hash and upload from one computer, you need to do it from a computer that can directly connect to your Microsoft 365 tenant. This requires that your clear-text sensitive information source table file is on that computer for hashing.

If you don't want to expose your clear-text sensitive information source table file on the direct access computer, you can hash it on a computer that's in a secure location. Then, you can copy the hash file and the salt file to a computer that can connect directly to your Microsoft 365 tenant for upload. In the separated hash and upload scenario, you'll need the EDMUploadAgent on both computers.

Important

If you used the Exact Data Match schema and sensitive information type wizard to create your schema file, you must download the schema for this procedure if you haven't already done so. See, Export of the EDM schema file in XML format.

Note

If your organization has set up Customer Key for Microsoft 365 at the tenant level, an exact data match will use the encryption functionality automatically. This is available only to E5 licensed tenants in the Commercial cloud.

Best practices

Separate the processes of hashing and uploading the sensitive data so you can more easily isolate any issues in the process.

Once in production, keep the two steps separate in most cases. Performing the hashing process on an isolated computer and then transferring the file for upload to an internet-facing computer ensures that the actual data is never available in clear text form on a computer that could have been compromised due to its connection to the Internet.

Ensure your sensitive data table doesn't have formatting issues

Before you hash and upload your sensitive data, do a search to validate the presence of special characters that may cause problems in parsing the content. You can validate that the table is in a format suitable to use with EDM by using the EDM upload agent with the following syntax:

EdmUploadAgent.exe /ValidateData /DataFile [data file] /Schema [schema file]

If the tool indicates a mismatch in number of columns, it might be due to the presence of commas or quote characters within values in the table that are being confused with column delimiters. Unless they're surrounding a whole value, single and double quotes can cause the tool to misidentify where an individual column starts or ends.

If you find single or double quote characters surrounding full values: you can leave them as they are.

If you find single quote characters or commas inside a value: for example the person's name Tom O'Neil or the city 's-Gravenhage, which starts with an apostrophe character, you need to modify the data export process used to generate the sensitive information table and surround such columns with double quotes.

If double quote characters are found inside values, it might be preferable to use the Tab-delimited format for the table, which is less susceptible to such issues.

Prerequisites

  • a work or school account for Microsoft 365 to add to the EDM_DataUploaders security group
  • a Windows 10, Windows Server 2016 with .NET version 4.6.2, or a Windows Server 2019 machine for running the EDMUploadAgent
  • a directory on your upload machine for the following:
    • EDM Upload Agent
    • your sensitive item file in .csv, .tsv or pipe (|) format, PatientRecords.csv in our examples
    • the output hash and salt files created in this procedure
    • the datastore name from the edm.xml file, for this example its PatientRecords

Important

  1. If using Windows Server 2016 or before, you must also install Visual C++ prior to installing the EDM Upload Agent.

  2. Install the EDM Upload Agent in a custom folder so you don't need administrator permissions. If you install it into the default (Program Files), administrator permissions are required.

Set up the security group and user account

  1. As a global administrator, go to the admin center using the appropriate link for your subscription and create a security group called EDM_DataUploaders.

  2. Add one or more users to the EDM_DataUploaders security group. (These users manage the database of sensitive information.)

Hash and upload from one computer

This computer must have direct access to your Microsoft 365 tenant.

Note

Before you begin this procedure, make sure that you are a member of the EDM_DataUploaders security group.

Tip

Optionally, you can run a validation against your sensitive information source table file to check it for errors before uploading by running:

EdmUploadAgent.exe /ValidateData /DataFile [data file] /Schema [schema file]

For more information on all the EdmUploadAgent.exe supported parameters run

EdmUploadAgent.exe /?

  1. Create a working directory for the EDMUploadAgent. For example, C:\EDM\Data. Place the PatientRecords.csv file there.

  2. Download and install the appropriate EDM Upload Agent for your subscription into the directory you created in step 1.

    • Commercial + GCC - Most commercial customers should use this option.
    • GCC-High - This option is specifically for high-security government cloud subscribers.
    • DoD - This option is specifically for United States Department of Defense cloud customers.

    Note

    The EDMUploadAgent at the above links has been updated to automatically add a salt value to the hashed data. Alternately, you can provide your own salt value. Once you have used this version, you will not be able to use the previous version of the EDMUploadAgent.

    You can upload data with the EDMUploadAgent to any given data store up to five times per day.

  3. Authorize the EDM Upload Agent, open Command Prompt window as an administrator, switch to the C:\EDM\Data directory, and then run the following command:

    EdmUploadAgent.exe /Authorize

    Important

    You must run the EdmUploadAgent from the folder where it's installed, and indicate the full path to your data files.

  4. Sign in with your work or school account for Microsoft 365 that was added to the EDM_DataUploaders security group. Your tenant information is extracted from the user account to make the connection.

    IMPORTANT: If you used the Exact Data Match schema and sensitive information type wizard to create your schema, you must download it for use in this procedure if you haven't already. Run this command in a Command Prompt window:

    EdmUploadAgent.exe /SaveSchema /DataStoreName <schema name> /OutputDir <path to output folder>
    
  5. To hash and upload the sensitive data, run the following command in Command Prompt window:

    EdmUploadAgent.exe /UploadData /DataStoreName [DS Name] /DataFile [data file] /HashLocation [hash file location] /Schema [Schema file] /ColumnSeparator ["{Tab}"|"|"] /AllowedBadLinesPercentage [value]
    

    Note

    The default format for the sensitive data file is comma-separated values. You can specify a tab-separated file by indicating the "{Tab}" option with the /ColumnSeparator parameter, or you can specify a pipe-separated file by indicating the "|" option.

    Example: EdmUploadAgent.exe /UploadData /DataStoreName PatientRecords /DataFile C:\Edm\Hash\PatientRecords.csv /HashLocation C:\Edm\Hash /Schema edm.xml /AllowedBadLinesPercentage 5

EDM and double-byte character set languages

Exact data match supports double-byte characters, such as those used in Chinese, Japanese, and Korean. However, it does not support string matches for corroborative evidence encoded as double byte characters. Neither does it match multi-token CJK text detected in the classified content, unless globalization for EDM has been enabled as described below. In all cases, a SIT must be mapped to any multi-token text, both for the primary field as well as for corroborative evidence fields.

Important

To invoke exact data matching for double-byte characters, you need to take the following steps:

  1. Create an EDM Sensitive Information Type (SIT) that’s intended to match on the double-byte character set language, such as Japanese kanji.

  2. Ensure you have downloaded and installed version 17.01.0495.0 (or later) of the EDM Upload Agent

  3. Update the EDMUploadAgent.exe.config file’s globalization parameter to true: <add key=" IsGlobalizationEnabled" value="true">

  4. Hash and upload a source table with the data to be matched.

Separate Hash and upload

Perform the hash on a computer in a secure environment. You must have the EDMUploadAgent installed on both computers.

OPTIONAL: If you used the Exact Data Match schema and sensitive information type wizard to create your schema and you haven't already downloaded it, run the following command in a Command Prompt window to download the file in XML format:

EdmUploadAgent.exe /SaveSchema /DataStoreName <schema name> /OutputDir <path to output folder>
  1. On the computer in the secure environment, run the following command in a Command Prompt window:

    EdmUploadAgent.exe /CreateHash /DataFile [data file] /HashLocation [hash file location] /Schema [Schema file] /AllowedBadLinesPercentage [value]
    

    For example:

    EdmUploadAgent.exe /CreateHash /DataFile C:\Edm\Data\PatientRecords.csv /HashLocation C:\Edm\Hash /Schema edm.xml /AllowedBadLinesPercentage 5
    

    Note

    The default format for the sensitive data file is comma-separated values. You can specify a tab-separated file by indicating the "{Tab}" option with the /ColumnSeparator parameter, or you can specify a pipe-separated file by indicating the "|" option.

    This outputs a hashed file and a salt file with these extensions if you didn't specify the /Salt <saltvalue> option:

    • .EdmHash
    • .EdmSalt
  2. Copy these files in a secure fashion to the computer you use to upload your sensitive information source table file (PatientRecords) to your tenant.

  3. Authorize the EDM Upload Agent, open Command Prompt window as an administrator, switch to the C:\EDM\Data directory, and then run the following command:

    EdmUploadAgent.exe /Authorize
    

    Important

    You must run the EdmUploadAgent from the folder where it's installed, and indicate the full path to your data files.

  4. Sign in with your work or school account for Microsoft 365 that was added to the EDM_DataUploaders security group. Your tenant information is extracted from the user account to make the connection.

  5. To upload the hashed data, run the following command in Windows Command Prompt:

    EdmUploadAgent.exe /UploadHash /DataStoreName \<DataStoreName\> /HashFile \<HashedSourceFilePath\ /ColumnSeparator ["{Tab}"|"|"]
    

    For example:

    EdmUploadAgent.exe /UploadHash /DataStoreName PatientRecords /HashFile C:\\Edm\\Hash\\**PatientRecords.EdmHash**
    
  6. To verify that your sensitive data has been uploaded, run the following command in a Command Prompt window:

    EdmUploadAgent.exe /GetDataStore
    

    You see a list of data stores and when they were last updated.

  7. If you want to see all the data uploads to a particular store, run the following command in a Command Prompt window to see a list of all the data stores and when they were updated:

    EdmUploadAgent.exe /GetSession /DataStoreName <DataStoreName>
    

Note

To automate the hash and upload process after you have created it the first time, see Refresh your exact data match sensitive information source table file.

Next steps

or