Flatten a JSON Nested Array in Azure Synapse pipeline

sachin gupta 376 Reputation points
2024-08-29T18:01:41.31+00:00

I am trying to flatten a JSON which has nested array objects with 1:M mapping. Please refer below JSON response.

{
    "result": [
        {
            "data": [
                {
                    "dimensions": [
                        "SYNTHETIC-1"
                    ],
                    "timestamps": [
                        17064756000,
                        17064752000,
                        17064751000,
                        17064750000
                    ],
                    "values": [
                        null,
                        null,
                        100,
                        99.6354
                    ]
                },
                {
                    "dimensions": [
                        "SYNTHETIC-2"
                    ],
                    "timestamps": [
                        17064755000,
                        17064754000,
                        17064753000,
                        17064752000
                    ],
                    "values": [
                        101,
                        null,
                        100,
                        99.6354
                    ]
                }
            ]
        }
    ]
}

The above JSON response is an output from a web activity in Azure Synapse. I have used Data Flow to flatten this JSON and unroll by first 'result' and then by 'data'. This gives me dimensions[], timestamps[] and values[] array, but it is not giving the right output. The output I am getting is below:

dimensions    timestamps    values
SYNTHETIC-1   17064756000   null
SYNTHETIC-1   17064756000   null
SYNTHETIC-1   17064756000   100
SYNTHETIC-1   17064756000   99.6354
SYNTHETIC-1   17064752000   null
SYNTHETIC-1   17064752000   null
SYNTHETIC-1   17064752000   100
SYNTHETIC-1   17064752000   99.6354
SYNTHETIC-1   17064751000   null
SYNTHETIC-1   17064751000   null
SYNTHETIC-1   17064751000   100
SYNTHETIC-1   17064751000   99.6354
SYNTHETIC-1   17064750000   null
SYNTHETIC-1   17064750000   null
SYNTHETIC-1   17064750000   100
SYNTHETIC-1   17064750000   99.6354
SYNTHETIC-2   17064755000   101
SYNTHETIC-2   17064755000   null
SYNTHETIC-2   17064755000   100
SYNTHETIC-2   17064755000   99.6354
SYNTHETIC-2   17064754000   101
SYNTHETIC-2   17064754000   null
SYNTHETIC-2   17064754000   100
SYNTHETIC-2   17064754000   99.6354
SYNTHETIC-2   17064753000   101
SYNTHETIC-2   17064753000   null
SYNTHETIC-2   17064753000   100
SYNTHETIC-2   17064753000   99.6354
SYNTHETIC-2   17064752000   101
SYNTHETIC-2   17064752000   null
SYNTHETIC-2   17064752000   100
SYNTHETIC-2   17064752000   99.6354

The output I am expecting is like below:

dimensions    timestamps    values
SYNTHETIC-1   17064756000   null
SYNTHETIC-1   17064752000   null
SYNTHETIC-1   17064751000   100
SYNTHETIC-1   17064750000   99.6354
SYNTHETIC-2   17064755000   101
SYNTHETIC-2   17064754000   null
SYNTHETIC-2   17064753000   100
SYNTHETIC-2   17064752000   99.6354

Can someone please help me to achieve this?

TIA.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,875 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 15,040 Reputation points Microsoft Vendor
    2024-08-30T12:11:16.8466667+00:00

    Hi @sachin gupta
    Thanks for the question and using MS Q&A platform.
    I saw the same question has been asked in Stack Overflow thread - https://stackoverflow.com/questions/78929093/flatten-a-json-nested-array-in-azure-synapse-pipeline one of our support engineers has already answered your question.

    Answer Summary (from SO):

    Currently, Dataflow doesn't have any feature to flatten all the nested arrays as per your requirement. In your case, you can try the below workaround using combination of flattens and join transformations.

    For some reason, your timestamps array is giving me only null values when the type is an integer array. So, I have set it to the string array. You need to change this array as per your requirement in the source dataset schema.

    First, use flatten transformation like below to flatten the data array and to get the inner arrays as columns. Use Rule-based mapping for this.

    enter image description here

    It will give the result like below.

    enter image description here

    Then follow the transformations as shown in the above image.

    Use flatten2 transformation to flatten the dimensions array. In flatten, give the dimension array for both Unroll by and Unroll root options. It will flatten this array.

    enter image description here

    Then take a select2 transformation and remove the timestamps column.

    Take a new branch from flatten2 and add another select transformation select1 and here remove the values column.

    Now, add another flatten transformation flatten3 to the select2 and flatten the array column values. For both Unroll by and Unroll root options, give values column only.

    Similarly, add flatten4 to the select1 and here do it for timestamps array column.

    After that, take SurrogateKey transformation surrogateKey1 to the flatten3 and create a key column key1 and give the step and start value as 1.

    enter image description here

    Similarly, do the same for flatten4 transformation. give the same key name as well.

    Now, inner join both surrogate key transformations using join transformation based on the common column key1.

    enter image description here

    Now, it will have required results, use another select transformation and remove the duplicate columns and extra key1 column.

    Now, you will get the desired result.

    enter image description here

    Use the below dataflow script to build your dataflow.

    source(output(
            result as (data as (dimensions as string[], timestamps as string[], values as string[])[])[]
        ),
        allowSchemaDrift: true,
        validateSchema: false,
        ignoreNoFilesFound: false,
        documentForm: 'singleDocument') ~> source1
    source1 foldDown(unroll(result.data, result),
        mapColumn(
            each(result.data,match(true()))
        ),
        skipDuplicateMapInputs: false,
        skipDuplicateMapOutputs: false) ~> flatten1
    flatten1 foldDown(unroll(dimensions, dimensions),
        mapColumn(
            dimensions,
            timestamps,
            values
        ),
        skipDuplicateMapInputs: false,
        skipDuplicateMapOutputs: false) ~> flatten2
    flatten2 select(mapColumn(
            dimensions,
            timestamps
        ),
        skipDuplicateMapInputs: true,
        skipDuplicateMapOutputs: true) ~> select1
    flatten2 select(mapColumn(
            dimensions,
            values
        ),
        skipDuplicateMapInputs: true,
        skipDuplicateMapOutputs: true) ~> select2
    select2 foldDown(unroll(values, values),
        mapColumn(
            dimensions,
            values
        ),
        skipDuplicateMapInputs: false,
        skipDuplicateMapOutputs: false) ~> flatten3
    select1 foldDown(unroll(timestamps, timestamps),
        mapColumn(
            dimensions,
            timestamps
        ),
        skipDuplicateMapInputs: false,
        skipDuplicateMapOutputs: false) ~> flatten4
    flatten3 keyGenerate(output(key1 as long),
        startAt: 1L,
        stepValue: 1L) ~> surrogateKey1
    flatten4 keyGenerate(output(key1 as long),
        startAt: 1L,
        stepValue: 1L) ~> surrogateKey2
    surrogateKey1, surrogateKey2 join(surrogateKey1@key1 == surrogateKey2@key1,
        joinType:'inner',
        matchType:'exact',
        ignoreSpaces: false,
        broadcast: 'auto')~> join1
    join1 select(mapColumn(
            dimensions = flatten3@dimensions,
            values,
            dimensions = flatten4@dimensions,
            timestamps
        ),
        skipDuplicateMapInputs: true,
        skipDuplicateMapOutputs: true) ~> select3
    select3 sink(allowSchemaDrift: true,
        validateSchema: false,
        input(
            result as (data as (dimensions as string[], timestamps as string[], values as string[])[])[]
        ),
        umask: 0022,
        preCommands: [],
        postCommands: [],
        skipDuplicateMapInputs: true,
        skipDuplicateMapOutputs: true) ~> sink1
    
    
    

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.