OVER Expression (U-SQL)
Summary
A windowing expression is an expression whose value is computed by applying for every row a window function to multiple row values of a column (the window defined by the OVER operator) instead of just the column’s value of the row.
Note
The OVER expression is sometimes referred to as an OVER Clause to make it search discoverable.
The OVER operator allows to specify a partition or window of column data that the window functions operate on. The windows provide the ability to access data from previous rows without having to use a self-join and provide both cumulative and sliding windows. It thus provides the ability to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.
It is similar to aggregation over grouping but instead of grouping the whole rowset with GROUP BY and then returning a single aggregated value for each group, the window defines a set of rows for a particular column in the SELECT clause that will be aggregated for each row. In the GROUP BY case, the result will be a row per group, with windowing expressions, the number of rows returned by the SELECT will not be affected by the windows.
Windowing expressions are only supported inside expressions in a SELECT FROM clause. More than one windowing expression can be used in a single query with a single FROM clause. The OVER operator for each function can differ in partitioning and ordering.
A windowing expression can never be used if a GROUP BY clause is present.
Syntax
Windowing_Expression := Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ] [ Order_By_Clause ] [ Row_Clause ] ')'.
Remarks
Syntax
Window_Function_Call := Aggregate_Function_Call | Analytic_Function_Call | Ranking_Function_Call.
Window functions can be one of the following:
- Aggregate Functions such as SUM or MAX.
- Analytic Functions such as FIRST_VALUE or LAST_VALUE.
- Ranking Functions such as RANK or ROW_NUMBER.
Aggregation functions applied to a window cannot be used with DISTINCT.
OVER ( … )
The OVER operator that specifies the window with the following components. Note that certain windowing functions have certain requirements with respect to the presence or absence of some if these components. These requirements are explained in the relevant section describing that function.Over_Partition_By_Clause
The OVER operator’s optional partition clause defines the window by partitioning the rowset. The window function is applied to each partition separately and computation restarts for each partition.
Syntax
Over_Partition_By_Clause := 'PARTITION' 'BY' Expression_List.
The data can be partitioned according to a list of scalar expressions. The result type of each of the expression has to return an equality comparable type or an error is raised. Most commonly, the partition expressions refer to the rowset’s columns (as specified by the FROM clause and not the columns from the SELECT clause).
If PARTITION BY is not specified, the function treats all rows of the query result set as a single group.
Order_By_Clause
The OVER operator’s optional ORDER BY clause defines the order of the rows withing each of the windows.
Syntax
Order_By_Clause := 'ORDER' 'BY' Sort_Item_Expression_List.
Sort_Item_Expression_List := Sort_Item_Expression { ',' Sort_Item_Expression }.
Sort_Item_Expression := expression [Sort_Direction].
The syntax and semantics follows the normal ORDER BY clause. For window functions that depend on the order such as the ranking functions, this order by clause specifies the logical order in which the window function calculation is performed.
If the ORDER BY clause is not specified, then the window is not ordered.
Row_Clause
The OVER operator’s optional ROWS clause further limits the rows within a partition by specifying fixed number of rows preceding or following the current row. Preceding and following rows are defined based on the ordering in the ORDER BY clause. Therefore, the ROWS clause requires that the ORDER BY clause be specified.
Syntax
Row_Clause := 'ROWS' Window_Frame_Extent.
If the ROWS clause is not specified but ORDER BY is specified, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is used as the default for the window frame.
The ROWS clause is incompatible with the DISTINCT aggregate option and an error is raised.
- Window_Frame_Extent
The window frame extent specifies the rows specifying a lower bound and an upper bound by either just specifying the preceding rows from the current row or providing a frame between two explicitly specified end points.
Syntax
Window_Frame_Extent := Window_Frame_Preceding | Window_Frame_Between.
For example, ROWS BETWEEN 2 PRECEDING AND CURRENT ROW means that the window of rows that the function operates on is three rows in size, starting with 2 rows preceding until and including the current row.
Window_Frame_Preceding
Defines the window frame relative from the current row. Currently supported are:
Syntax
Window_Frame_Preceding := 'UNBOUNDED' 'PRECEDING' | unsigned_integer_literal 'PRECEDING' | 'CURRENT' 'ROW'.
With the following semantics:
UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. It can only be specified as the window starting point.unsigned_integer_literal PRECEDING
The nonnegative integer literal specifies the number of rows to precede the current row in the partition. If the value is larger than the available number of preceding rows in the partition, then the window starts at the first row of the partition.CURRENT ROW
Specifies that the window starts or ends at the current row. It can be specified as both a starting and ending point.Window_Frame_Between
Specifies the lower (starting) and upper (ending) boundary points of the window.
Syntax
Window_Frame_Between := 'BETWEEN' Window_Frame_Bound 'AND' Window_Frame_Bound.
Window_Frame_Bound := Window_Frame_Preceding | Window_Frame_Following.
The first frame bound specifies the lower, the second frame bound the upper boundary (inclusive). If the upper boundary is smaller than the lower boundary, an error is raised. Note that a frame can be all preceding or all following the current row. The frame bounds are specified as follows:
Window_Frame_Preceding
Specifies that the frame starts or ends before the current row as aboveWindow_Frame_Following
Specifies that the frame starts or ends following the current row:
Syntax
Window_Frame_Following := unsigned_integer_literal 'FOLLOWING' | 'CURRENT' 'ROW'.
With the following semantics:
unsigned_integer_literal FOLLOWING The nonnegative integer literal specifies the number of rows to follow the current row in the partition. If the value is larger than the available number of following rows in the partition, then last row of the partition is chosen as the boundary.
CURRENT ROW Specifies that the window boundary is at the current row.
Tip
U-SQL does currently not support UNBOUNDED FOLLOWING. One can achieve the same result by inverting the ordering and use UNBOUNDED PRECEDING instead.
Examples
Sample Data
It contains two views that we will use for the sample data.
- QueryLog
To access this data use the code below.
querylog = Demo.QueryLog();
data:
Fruit | Quantity | Source |
---|---|---|
Banana | 300 | Image |
Cherry | 300 | Image |
Durian | 500 | Image |
Apple | 100 | Web |
Fig | 200 | Web |
Papay | 200 | Web |
Avocado | 300 | Web |
Cherry | 400 | Web |
Durian | 500 | Web |
- Employees
To access this data use the code below.
employees = Demo.Employees();
data:
EmpID | EmpName | DeptName | DeptID | Salary |
---|---|---|---|---|
1 | Noah | Engineering | 100 | 10000 |
2 | Sophia | Engineering | 100 | 20000 |
3 | Liam | Engineering | 100 | 30000 |
4 | Emma | HR | 200 | 10000 |
5 | Jacob | HR | 200 | 10000 |
6 | Olivia | HR | 200 | 10000 |
7 | Mason | Executive | 300 | 50000 |
8 | Ava | Marketing | 400 | 15000 |
9 | Ethan | Marketing | 400 | 10000 |
Now, let’s walk through windowing functions and aggregations. As we do this the answer to this question will be clear and we’ll also understand the conceptual difference of Windowing Functions from Grouping.
data3 =
SELECT EmpName, SUM(Salary) OVER( ) As SalaryAllDepts
FROM employees;
Note these things:
- We are still doing a SUM() but it is modified by an OVER clause
- The OVER clause is empty – there’s nothing between the parentheses
- The OVER clause defines the “window” – in this case since it is empty the window is the entire set of rows
- Thus you can read SUM(Salary) OVER () as “The sum of Salary across the window of rows”
EmpName | TotalAllDepts |
---|---|
Noah | 165000 |
Sophia | 165000 |
Liam | 165000 |
Emma | 165000 |
Jacob | 165000 |
Olivia | 165000 |
Mason | 165000 |
Ava | 165000 |
Ethan | 165000 |
When we did the first SUM without any GROUP BY we received the correct total salary 165000. But grouping collapsed that into a single row. Here we see there are as many output rows as input rows. More interestingly the 165000 value is clearly calculated in each row but to calculate it data from EVERY row was used.
Another way of thinking about this is that each output row has knowledge of multiple input rows - in this all of the input rows. The “window” is all the rows. Now let’s try an OVER clause that contains something. In this case by using PARTITION BY DeptName our window is based on DeptName.
data4 =
SELECT EmpName, DeptName, SUM(Salary) OVER( PARTITION BY DeptName ) AS SalaryByDept
FROM employees;
result:
EmpName | DeptName | SalaryByDept:double |
---|---|---|
Noah | Engineering | 60000 |
Sophia | Engineering | 60000 |
Liam | Engineering | 60000 |
Mason | Executive | 50000 |
Emma | HR | 30000 |
Jacob | HR | 30000 |
Olivia | HR | 30000 |
Ava | Marketing | 25000 |
Ethan | Marketing | 25000 |
Again, notice that there are the same number of input rows as output rows. However now each row has a Department total - again this is the cause the window for SUM was defined on the DeptName. Key points
- grouping collapses input rows
- with grouping aggregation happens on the entire set of rows or what is specified by the GROUP BY clause
- windowing does not collapse rows
- with windowing aggregations happen on the window defined by the OVER clause – either all rows and the rows defined by the window
- OVER( ) – defines a window of all rows
- OVER( PARTITION BY X ) – defines a window on column X
Now let’s calculate the percentage of salary devoted to each department form the total salary across all departments. We can’t mix Windowing Functions and GROUP BY in the same statement so we’ll build it up like this:
a =
SELECT TOP 1 SUM(Salary) OVER( ) AS SalaryAllDepts
FROM employees;
b =
SELECT DeptName, SUM(Salary) AS SalaryByDept
FROM employees
GROUP BY DeptName;
c =
SELECT DeptName, SalaryByDept, SalaryAllDepts, (SalaryByDept/SalaryAllDepts) AS Percentage
FROM a CROSS JOIN b;
result:
DeptName | SalaryByDept | SalaryAllDepts | Percentage |
---|---|---|---|
Engineering | 60000 | 165000 | 0.363636363636364 |
Executive | 50000 | 165000 | 0.303030303030303 |
HR | 30000 | 165000 | 0.181818181818182 |
Marketing | 25000 | 165000 | 0.151515151515152 |