Parsing a field within a CSV

Question

Hey team,

I am using ADF to process a CSV file. the CSV file has the following format

ProductID,ProductCat,Details

1,"Category1","""Cat1"",9,""Y"",""test1@yahoo.com"","

2,"Category2","""Cat2"",5,""N"",""test2@gmail.com"","

IF you save this as a csv and open in Excel , you would see that the third field Details is really a concatenation of a number of sub-fields.

The Detail field itself contains a CSV text with " as the text qualifier. if you notice the 4th attribute within Details is the email field.

We have now been asked if we could mask the emails with in the Details field.

Is there any way to achieve this in ADF please?

Please guide.

Answer

Yes, you can achieve email masking in the Details field within your CSV file using Azure Data Factory (ADF). You can accomplish this by using a combination of ADF components and expressions.

Here's a high-level approach to masking the emails in the Details field:

Create Create a new pipeline in ADF.
Add a "Copy Data" activity to read the CSV file from the source datastore (e.g., Blob Storage or Azure Data Lake Storage).
In the "Source" tab of the "Copy Data" activity, configure the source dataset with your CSV file format and settings.
Add a "Derived Column" transformation in the Mapping Data Flows.
Use the "split()" function in the "Derived Column" transformation to split the Details field using a comma (,) as the delimiter.
Use the "replace()" function to mask the email addresses in the corresponding sub-field (4th attribute within Details). You can use a regular expression to find the email addresses and replace them with the masked value.
Use the "concat()" function to reconstruct the Details field after masking the email addresses.
In the "Sink" tab of the "Copy Data" activity, configure the output dataset where you want to write the processed CSV file.
Publish the pipeline and execute it.

Here's a sample expression to be used in the "Derived Column" transformation for masking the email addresses within the Details field:

concat(split(Details, ',')[0], ',',
       split(Details, ',')[1], ',',
       split(Details, ',')[2], ',',
       replace(split(Details, ',')[3], regex_match(split(Details, ',')[3], '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), '***MASKED***'), ',',
       split(Details, ',')[4])

This example assumes that the Details field has exactly five sub-fields, and the email address is in the fourth sub-field. Adjust the expression as needed to match the actual structure of your Details field.

Answer

@MS Prog ,

Thankyou for using Microsoft Q&A platform and thanks for posting your question here.

As I understand your query, you are trying to mask the mailid before '@'. Please let me know if that's not the case.

You can try this approach:

Use no delimiter as column delimiter in the source dataset so that the whole dataset will be treated as the single column
Now split the data by using @ symbol and mask the data before @ and after " symbol by using replace and substring function . Use column pattern to replace all characters by * in order to mask the data.
Finally concat the data by using concat function.

Find more details of all functions here: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions

Hope it helps . Let me know if you are stuck anywhere so that I can try to repro and produce detailed steps. Kindly accept the answer if its helpful. Thanks.

Parsing a field within a CSV

2 answers