Can Azure Data Factory dynamically combine multiple columns as sub-columns of a new single column?

Question

Can Azure Data Factory dynamically combine multiple columns as sub-columns of a new single column?

acham 25

I have a set of data like this:

date|tagColumn1|tagColumn2
2023-01-01|one|two
2023-01-02|foo|
2023-01-03||bar

Is there a dynamic way to create a new single column 'myTags' with a sub-column for each column name that starts with 'tag' and has a value?

I know that I can use derived column modifier in a data flow to statically map sub-columns, but I want to allow schema drift and catch any new tagColumns in the source. Expected result would be something like this:

date|tagColumn1|tagColumn2|myTags
2023-01-01|one|two|{"tagColumn1":"one","tagColumn2":"two"}
2023-01-02|foo||{"tagColumn1":"foo"}
2023-01-03||bar|{"tagColumn2":"bar"}

Can this be done within an ADF data flow? I could then drop the original tag columns in a select modifier and move on.

2 answers

Your answer

Answer 1

Poked at the expression builder for a bit and figured out a way to dynamically output columns with desired name prefix as single json string or null (if empty). It's a bit complicated and it can probably be improved, but it works natively in Azure Data Factory expression builder for a derived column. I'm doing this on multiple derived columns to reduce 280 columns (most empty) down to 5. Just replace the two instances of 'tagColumn/' with the desired filter.

reduce(mapIf(mapAssociation(keyValues(columnNames(), array(toString(columns()))), @(key = #key, value = #value)), and(startsWith(#item.key, 'tagColumn/'), not(isNull(#item.value))), concat('"', replace(#item.key, 'tagColumn/'), '":"', #item.value, '"')), "", iif(notEquals(#acc, ''), #acc + ',' + toString(#item), #acc + toString(#item)), iif(toString(#result) != '', '{' + toString(#result) + '}', toString(null())))

Answer 2

Amira Bedhiafi 34,101 Volunteer Moderator

You can use schema drift capabilities to ensure that any new column with the name pattern "tagColumn*" can be accommodated in your Data Flow.

You can use a series of conditional logic within a derived column transformation. This would involve checking each possible "tagColumn" for a value and constructing its JSON string.

 iif(isNull(tagColumn1), '', '\"tagColumn1\":\"' + tagColumn1 + '\"') + iif(isNull(tagColumn2), '', ', \"tagColumn2\":\"' + tagColumn2 + '\"')

or use a UDF that xpects a Map where the key is the column name and the value is the column value. It filters out null or empty values and those keys that don't start with "tagColumn".

def constructJSON(tagColumns: Map[String, String]): String = {     val validEntries = tagColumns.filter { case (key, value) => key.startsWith("tagColumn") && value != null && value != "" }     val jsonString = validEntries.map { case (key, value) => s""""$key":"$value"""" }.mkString("{", ",", "}")     jsonString }

acham 25 Reputation points

2023-09-19T15:39:52.0833333+00:00

Your suggestion of a UDF seems like a good path for me to try. The code you pasted does not appear to be usable with Azure Data Factory, I'm assuming it's Databricks? I'll try to work around the limitations in ADF expression builder and come up with something similar.

Amira Bedhiafi 34,101 Volunteer Moderator

Another alternative you can use Python like below in an Azure function and then use the "Copy Data" activity to retrieve your source data.

Use the "Azure Function" activity to send each row to your function for processing:

Set the URL to your Azure Function's endpoint.
Set the request method to POST.
Set the request body to pass the current row as JSON.

import json
import azure.functions as func

def main(req: func.HttpRequest) -> func.HttpResponse:
    try:
        # Get JSON input from request
        data = req.get_json()
        
        # Filter out tag columns and construct the JSON
        tag_columns = {k: v for k, v in data.items() if k.startswith("tagColumn") and v}
        tag_json = json.dumps(tag_columns)

        # Add the JSON string to the input data and return
        data["myTags"] = tag_json
        
        return func.HttpResponse(json.dumps(data), status_code=200)
    except Exception as e:
        return func.HttpResponse(f"Error: {str(e)}", status_code=400)

Share via

Can Azure Data Factory dynamically combine multiple columns as sub-columns of a new single column?

2 answers

Your answer