Share via

ADF dataflow aggregate

arkiboys 9,711 Reputation points
2024-04-29T06:13:37.35+00:00

Hello,

at present when aggregating a column, this is what I use:
Aggregate transformation,
Group by --> key columns -- keycolumn1, keycolumn2, keycolumn3
Aggregates --> Each column that matches --> name!='keycolumn1'&&name!='keycolumn2'&&name!='keycolumn3'
$$ max($$)

Question 1 --> what is the difference between first, last and max?
Question 2 --> Is this the correct way to aggregate data?

Thank you

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.

0 comments No comments

Answer accepted by question author

Amira Bedhiafi 42,941 Reputation points MVP Volunteer Moderator
2024-04-29T10:21:54.9133333+00:00

I will split my answer on 2 parts :

Question 1 :

first: returns the first value in the group of data that is being processed, according to the order defined in the data stream. It doesn’t necessarily provide the minimum value; instead, it gives the first encountered value in the incoming dataset's order.

last: returns the last value in the data group, based on the order in which data arrives or is sorted in the flow. It is sensitive (like First) to the order of the data and doesn’t imply a maximum or minimum value but simply the last encountered value.

max: finds the maximum value in the group for the specified column.

Question 2 :

I think your current method for aggregation appears to be correct based on your description:

You are grouping by the specified key columns (keycolumn1, keycolumn2, keycolumn3), which is standard for any aggregation that involves calculating summaries based on specific categories or groups within your data.

When it comes to aggregating, your approach specifies that each column, except for the key columns themselves, should be aggregated using the max function. This is a common practice if your intention is to find the maximum value of each column within the defined groups.

However, these are my concerns :

If you're using the first or last functions, be aware of how your data is sorted, as these functions depend on the order of data.

Make sure that the aggregation functions you choose are suitable for the data types of the columns. For example, max is generally used with numeric data.

If you are using max on many columns can be resource-intensive if you need summaries for specific analytical purposes, tailor your aggregation functions accordingly.

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.