Can Azure Data Factory dynamically combine multiple columns as sub-columns of a new single column?

acham 25 Reputation points
2023-09-15T20:05:12.3633333+00:00

I have a set of data like this:

date|tagColumn1|tagColumn2
2023-01-01|one|two
2023-01-02|foo|
2023-01-03||bar

Is there a dynamic way to create a new single column 'myTags' with a sub-column for each column name that starts with 'tag' and has a value?

I know that I can use derived column modifier in a data flow to statically map sub-columns, but I want to allow schema drift and catch any new tagColumns in the source. Expected result would be something like this:

date|tagColumn1|tagColumn2|myTags
2023-01-01|one|two|{"tagColumn1":"one","tagColumn2":"two"}
2023-01-02|foo||{"tagColumn1":"foo"}
2023-01-03||bar|{"tagColumn2":"bar"}

Can this be done within an ADF data flow? I could then drop the original tag columns in a select modifier and move on.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,651 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. acham 25 Reputation points
    2023-09-20T01:40:23.4666667+00:00

    Poked at the expression builder for a bit and figured out a way to dynamically output columns with desired name prefix as single json string or null (if empty). It's a bit complicated and it can probably be improved, but it works natively in Azure Data Factory expression builder for a derived column. I'm doing this on multiple derived columns to reduce 280 columns (most empty) down to 5. Just replace the two instances of 'tagColumn/' with the desired filter.

    reduce(mapIf(mapAssociation(keyValues(columnNames(), array(toString(columns()))), @(key = #key, value = #value)), and(startsWith(#item.key, 'tagColumn/'), not(isNull(#item.value))), concat('"', replace(#item.key, 'tagColumn/'), '":"', #item.value, '"')), "", iif(notEquals(#acc, ''), #acc + ',' + toString(#item), #acc + toString(#item)), iif(toString(#result) != '', '{' + toString(#result) + '}', toString(null())))
    
    1 person found this answer helpful.
    0 comments No comments

  2. Amira Bedhiafi 34,101 Reputation points Volunteer Moderator
    2023-09-18T11:52:43.96+00:00

    You can use schema drift capabilities to ensure that any new column with the name pattern "tagColumn*" can be accommodated in your Data Flow.

    You can use a series of conditional logic within a derived column transformation. This would involve checking each possible "tagColumn" for a value and constructing its JSON string.

     iif(isNull(tagColumn1), '', '\"tagColumn1\":\"' + tagColumn1 + '\"') + iif(isNull(tagColumn2), '', ', \"tagColumn2\":\"' + tagColumn2 + '\"')
    
    
    

    or use a UDF that xpects a Map where the key is the column name and the value is the column value. It filters out null or empty values and those keys that don't start with "tagColumn".

    def constructJSON(tagColumns: Map[String, String]): String = {     val validEntries = tagColumns.filter { case (key, value) => key.startsWith("tagColumn") && value != null && value != "" }     val jsonString = validEntries.map { case (key, value) => s""""$key":"$value"""" }.mkString("{", ",", "}")     jsonString } 
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.