Share via

Azure Data Factory: Flattening/normalizing a cloumn from CSV file using Azure Data Factory activity

Utkarsh Sharma 41 Reputation points
2020-07-03T11:56:56.683+00:00

HI,
I have pulled a csv file from one of our source using ADF and there is one column called "attributes" which contains multiple fields (in the form of key value pairs). Now I want to expand that column into different fields (columns). Below is the sample of that:

leadId activityDate activityTypeId campaignId primaryAttributeValue attributes
1234 2020-06-22T00:00:44Z 46 33686 Mail {"Description":"Clicked: https://stepuptostepout.com/","Source":"Lead action","Date":"2020-06-21 19:00:44"}
5678 2020-06-22T00:01:54Z 13 33128 SMS {"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo Flow Action"}

Here the attributes column have different Key-value pairs and I want them in different column so that I can store them in Azure SQL Database:

attributes
{"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo"}

I want them as:
Reason New Value Old Value Source
Changed 110 null Marketo

I am using Azure Data Factory. Please help!

As the keys are not uniform, also if there is one key (say 'Source') present for one lead ID it might not be present/missing in the other leadId, making this more complicated.
Thus, we can have a separate table for 'attribute' field with lead ID, AttributeKey, AttributeValue as columns (we can join this with our main table using LeadID). The Attribute table will look like:

LeadID AttributeKey AttributeValue
5678 Reason Changed
5678 New Value 110
5678 Old Value null
5678 Source Marketo

Can you help me I can I achieve this using ADF?

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


Answer accepted by question author

MartinJaffer-MSFT 26,161 Reputation points
2020-07-08T00:39:43.723+00:00

@UtkarshSharma-7440 I have some good news and some mixed news.

I have managed to accomplish this challenge using Mapping DataFlow.

I used the 3 lines of data from your original post. You said it was a csv, but the provided data had spaces as separator, not commas (except inside the json). I used your data as-is, with the spaces, not replacing with commas. I think I can tweak it for commas too, but one thing at a time.

The path I took to accomplish the task is complex. If you were to use a different column delimitor, instead of space or comma, we could do this in a simpler way. I.e. if it looked like
5678|2020-06-22T00:01:54Z|13|33128|SMS|{"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo Flow Action"}
If this were the case, I would try a 2-step process, first reading in as delimited text, and outputting as JSON, then reading in as JSON, and using the copy activity cross-apply feature (only available when source is complex like JSON and sink is flat/tabular).

As, is, I leveraged the strange behavior of Data Factory to make this work. I am relying on the JSON to be flat, not nested.
My input dataset is a delimited text. The quote character is { . The column delimiter is space.
Attached is the Data Flow script. Below is explanation.
11499-flattenembededjsonscript.txt

  1. The high view plan is to ingest the JSON portion as a string.
  2. Then separate the JSON into an array of key-value pairs.
  3. Then to flatten/unroll the array, so each entry goes onto a separate row.
  4. Then to split the key-value pairscolumn, into 2 colums, one for the key, the other for the value.

The first challenge is to ingest the JSON portion as a single column. The JSON portion contains both spaces and commas. To keep the entire JSON section together, I need to use the quote character in the dataset. The " does not work for us, because only the first pair of quotes is obeyed. This means subsequent quote pairs are ignored and will not work for us. However, the { will work as an opening quote character. In pipeline copy activity, this would not help us, because it would keep everything until the next '{'. However, in the DataFlow, for some reason, it does help us. At the time of writing this, DataFlow seems to implicitly end the quote on the row delimiter. This behavior quirk may change in the future. Right now, it results in the JSON portion being kept together, and removing the opening { on each input row. The closing } is kept on.

Your requirements were to only output LeadID, AttributeKey, and AttributeValue. I use a select activity to discard everything excape LeadID and attributes.

After that I use a derived column to remove the trailing } from the JSON portion. Easier to do this now than later.

Now with the JSON cleaned up, I split it with another derived column activity. Arrays are a valid structure in DataFlow, so I can split with regexSplit(attributes,',"') . I can use ," because all key names are quoted. This results in quotes being removed from all keys except the first key.

Now with my arrays, I can flatten, unrolling by attributes.

Now that I have each attribute on its own line, I need to split the pair into key and value. I also need to do some cleanup. There are still stray "s.
To this end, I use ltrim(regexSplit(attributes, '":"?')[1],'"') for the key and rtrim(regexSplit(attributes, '":"?')[2],'"') for the value.
The ltrim and rtrim are for removing stray ". I need to use regexSplit instead of regular split. This is because not all values are quoted. Some values are quoted, like "Changed" but null and 110 are not quoted. The regex "?handles this.

Now all that is left, is mapping and writing to sink.

Thank you for bringing the challenge to me. I hope this helps. Please let me know if you have any questions.

Was this answer helpful?

2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.