An Azure service for ingesting, preparing, and transforming data at scale.
I will split my answer on 2 parts :
Question 1 :
first: returns the first value in the group of data that is being processed, according to the order defined in the data stream. It doesn’t necessarily provide the minimum value; instead, it gives the first encountered value in the incoming dataset's order.
last: returns the last value in the data group, based on the order in which data arrives or is sorted in the flow. It is sensitive (like First) to the order of the data and doesn’t imply a maximum or minimum value but simply the last encountered value.
max: finds the maximum value in the group for the specified column.
Question 2 :
I think your current method for aggregation appears to be correct based on your description:
You are grouping by the specified key columns (keycolumn1, keycolumn2, keycolumn3), which is standard for any aggregation that involves calculating summaries based on specific categories or groups within your data.
When it comes to aggregating, your approach specifies that each column, except for the key columns themselves, should be aggregated using the max function. This is a common practice if your intention is to find the maximum value of each column within the defined groups.
However, these are my concerns :
If you're using the first or last functions, be aware of how your data is sorted, as these functions depend on the order of data.
Make sure that the aggregation functions you choose are suitable for the data types of the columns. For example, max is generally used with numeric data.
If you are using max on many columns can be resource-intensive if you need summaries for specific analytical purposes, tailor your aggregation functions accordingly.