Azure Data Factory - Split array into chunks

Steven Pretswell 1 Reputation point
2020-10-09T13:26:12.043+00:00

Hi there

I'm using Azure Data Factory Data Flow. I have a column that contains around 5000 ids.

I'm looking to merge this into 100 separate rows, each with an array of 50 ids.

I was wondering if you could advise the method of doing this, I can use collect() to merge everything into a single row, however I'm looking for 100 separate rows.

Thanks very much

Steven

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Kiran-MSFT 696 Reputation points Microsoft Employee
    2020-10-13T01:08:02.71+00:00

    Partitioning should be used only in the case of optimization not to do data integration logic.

    A optimal solution would be -

    1. Add a keyGenerate (Surrogate key tx) after the source - Say new key field is 'SrcKey'
    2. Add an aggregate where you group by (SrcKey/50). This will group similar divisors into the same bucket.
    3. Add a collect column in the same aggregator to collect into an array.
    2 people found this answer helpful.

  2. MartinJaffer-MSFT 26,236 Reputation points
    2020-10-12T21:24:19.147+00:00

    @Steven Pretswell I have reproduced and found a solution.

    First I created a csv file with the numbers 1...40, each number on its own line. This I use as source data. I will split these 40 numbers into 4 sets of 10.

    What you described is somewhat like partitioning. In the Data Flow source, I set the partitioning to round-robin, so the partitions would be evenly sized. The contents of each are not sequential. While my source data is the range of numbers 1...40, contiguous (no holes) and ordered, I have no idea what your id's are like. This is why I chose round-robin.
    Step 1
    31814-image.png

    Next, to make the partitions visible, and easier to work with, I added a derived column whose contents are partitionId(column_1).
    Step 2 (optional)
    31764-image.png

    Now I want to group the rows by the partition and do collect like you suggested, but only on the contents of the group, keeping the groups separate. This is accomplished by an aggregation transformation.
    Step 3 (important)
    Group by: column partition, name as partition
    Aggregate: column_1, collect(column_1)
    31708-image.png
    31832-image.png

    Now I preview the output of the aggregation, I see 4 rows, each a different partitionId. There is also an array in each. Peeking into the array I see 10 entries.
    31737-image.png

    This is similar to the state you wish to achieve. Now to clean up, either use a select to discard the partition id, or just exclude it in the sink. Also, now that I know this works, I could clean up some more. Since partitionId was generated by a function, I can removed the derived column, and in the aggregate group by clause, use partitionId() instead of a column.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.