GROUP BY and HAVING Clauses (U-SQL)
Summary
The optional GROUP BY clause groups the rows based on the provided expression list into groups that then can be aggregated over with the built-in and user-defined aggregators. It offers an optional filter clause with HAVING that will filter the rowset at the group level. In other words, the predicate in the HAVING clause will be applied to the group and will only include the groups for which it evaluates to true.
Syntax
Group_By_Clause := 'GROUP' 'BY' Expression_List [Having_Clause]
Having_Clause := 'HAVING' Boolean_Expression
Remarks
Expression_List
Provides the list of expressions and column references that define the hierarchical grouping. The groups are built from left to right: the left most expression defines the outer group, every subsequent expression defines a group inside the previous group. In the end all groups are flattened into a set of groups on which the aggregations will occur.Note that the SELECT clause in a SELECT expression with a GROUP BY has to contain only the columns or expressions that appear in the
GROUP BY
’s expression list and the aggregation expressions. Otherwise an error is raised. Expressions or column references in the GROUP BY do not have to be in the SELECT clause, even though they are being used to define the groups.Having_Clause
TheHAVING
clause offers an optional filter clause that will filter the rowset at the group level. In other words, the predicate in theHAVING
clause will be applied to the group and will only include the groups for which it evaluates to true.
Examples
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
The following query uses the GROUP BY clause to provide us with the total number of employees in each department.
@employees = SELECT *
FROM (VALUES
("Rafferty", (int?) 31)
, ("Jones", (int?) 33)
, ("Heisenberg", (int?) 33)
, ("Robinson", (int?) 34)
, ("Smith", (int?) 34)
, ("Williams", (int?) null)) AS E(EmpName, DepID);
@departments = SELECT *
FROM (VALUES
((int) 31, "Sales")
, ((int) 33, "Engineering")
, ((int) 34, "Clerical")
, ((int) 35, "Marketing")) AS D(DepID, DepName);
@rs_group_by =
SELECT d.DepName, COUNT(EmpName) AS TotalEmployees
FROM (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d
LEFT OUTER JOIN @employees AS e
ON d.DepID == e.DepID
GROUP BY d.DepName;
OUTPUT @rs_group_by
TO "/output/rsGroupBy.csv"
USING Outputters.Csv();
The following query uses the GROUP BY and HAVING clauses to provide us with the number of employees in each department that has at least one employee.
@employees = SELECT *
FROM (VALUES
("Rafferty", (int?) 31)
, ("Jones", (int?) 33)
, ("Heisenberg", (int?) 33)
, ("Robinson", (int?) 34)
, ("Smith", (int?) 34)
, ("Williams", (int?) null)) AS E(EmpName, DepID);
@departments = SELECT *
FROM (VALUES
((int) 31, "Sales")
, ((int) 33, "Engineering")
, ((int) 34, "Clerical")
, ((int) 35, "Marketing")) AS D(DepID, DepName);
@rs_having =
SELECT d.DepName, COUNT(EmpName) AS TotalEmployees
FROM (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d
LEFT OUTER JOIN @employees AS e
ON d.DepID == e.DepID
GROUP BY d.DepName
HAVING COUNT(EmpName) > 0;
OUTPUT @rs_having
TO "/output/rsHaving.csv"
USING Outputters.Csv();