An Azure service for ingesting, preparing, and transforming data at scale.
@UtkarshSharma-7440 I have some good news and some mixed news.
I have managed to accomplish this challenge using Mapping DataFlow.
I used the 3 lines of data from your original post. You said it was a csv, but the provided data had spaces as separator, not commas (except inside the json). I used your data as-is, with the spaces, not replacing with commas. I think I can tweak it for commas too, but one thing at a time.
The path I took to accomplish the task is complex. If you were to use a different column delimitor, instead of space or comma, we could do this in a simpler way. I.e. if it looked like
5678|2020-06-22T00:01:54Z|13|33128|SMS|{"Reason":"Changed","New Value":110,"Old Value":null,"Source":"Marketo Flow Action"}
If this were the case, I would try a 2-step process, first reading in as delimited text, and outputting as JSON, then reading in as JSON, and using the copy activity cross-apply feature (only available when source is complex like JSON and sink is flat/tabular).
As, is, I leveraged the strange behavior of Data Factory to make this work. I am relying on the JSON to be flat, not nested.
My input dataset is a delimited text. The quote character is { . The column delimiter is space.
Attached is the Data Flow script. Below is explanation.
11499-flattenembededjsonscript.txt
- The high view plan is to ingest the JSON portion as a string.
- Then separate the JSON into an array of key-value pairs.
- Then to flatten/unroll the array, so each entry goes onto a separate row.
- Then to split the key-value pairscolumn, into 2 colums, one for the key, the other for the value.
The first challenge is to ingest the JSON portion as a single column. The JSON portion contains both spaces and commas. To keep the entire JSON section together, I need to use the quote character in the dataset. The " does not work for us, because only the first pair of quotes is obeyed. This means subsequent quote pairs are ignored and will not work for us. However, the { will work as an opening quote character. In pipeline copy activity, this would not help us, because it would keep everything until the next '{'. However, in the DataFlow, for some reason, it does help us. At the time of writing this, DataFlow seems to implicitly end the quote on the row delimiter. This behavior quirk may change in the future. Right now, it results in the JSON portion being kept together, and removing the opening { on each input row. The closing } is kept on.
Your requirements were to only output LeadID, AttributeKey, and AttributeValue. I use a select activity to discard everything excape LeadID and attributes.
After that I use a derived column to remove the trailing } from the JSON portion. Easier to do this now than later.
Now with the JSON cleaned up, I split it with another derived column activity. Arrays are a valid structure in DataFlow, so I can split with regexSplit(attributes,',"') . I can use ," because all key names are quoted. This results in quotes being removed from all keys except the first key.
Now with my arrays, I can flatten, unrolling by attributes.
Now that I have each attribute on its own line, I need to split the pair into key and value. I also need to do some cleanup. There are still stray "s.
To this end, I use ltrim(regexSplit(attributes, '":"?')[1],'"') for the key and rtrim(regexSplit(attributes, '":"?')[2],'"') for the value.
The ltrim and rtrim are for removing stray ". I need to use regexSplit instead of regular split. This is because not all values are quoted. Some values are quoted, like "Changed" but null and 110 are not quoted. The regex "?handles this.
Now all that is left, is mapping and writing to sink.
Thank you for bringing the challenge to me. I hope this helps. Please let me know if you have any questions.