ADF dataflow aggregate

Question

ADF dataflow aggregate

arkiboys 9,711

Hello,

at present when aggregating a column, this is what I use:
Aggregate transformation,
Group by --> key columns -- keycolumn1, keycolumn2, keycolumn3
Aggregates --> Each column that matches --> name!='keycolumn1'&&name!='keycolumn2'&&name!='keycolumn3'
$$ max($$)

Question 1 --> what is the difference between first, last and max?
Question 2 --> Is this the correct way to aggregate data?

Thank you

0 comments

Answer accepted by question author

0 additional answers

Your answer

Answer 1

I will split my answer on 2 parts :

Question 1 :

first: returns the first value in the group of data that is being processed, according to the order defined in the data stream. It doesn’t necessarily provide the minimum value; instead, it gives the first encountered value in the incoming dataset's order.

last: returns the last value in the data group, based on the order in which data arrives or is sorted in the flow. It is sensitive (like First) to the order of the data and doesn’t imply a maximum or minimum value but simply the last encountered value.

max: finds the maximum value in the group for the specified column.

Question 2 :

I think your current method for aggregation appears to be correct based on your description:

You are grouping by the specified key columns (keycolumn1, keycolumn2, keycolumn3), which is standard for any aggregation that involves calculating summaries based on specific categories or groups within your data.

When it comes to aggregating, your approach specifies that each column, except for the key columns themselves, should be aggregated using the max function. This is a common practice if your intention is to find the maximum value of each column within the defined groups.

However, these are my concerns :

If you're using the first or last functions, be aware of how your data is sorted, as these functions depend on the order of data.

Make sure that the aggregation functions you choose are suitable for the data types of the columns. For example, max is generally used with numeric data.

If you are using max on many columns can be resource-intensive if you need summaries for specific analytical purposes, tailor your aggregation functions accordingly.

Share via

ADF dataflow aggregate

0 additional answers

Your answer