GROUP BY and HAVING Clauses (U-SQL)

Summary

The optional GROUP BY clause groups the rows based on the provided expression list into groups that then can be aggregated over with the built-in and user-defined aggregators. It offers an optional filter clause with HAVING that will filter the rowset at the group level. In other words, the predicate in the HAVING clause will be applied to the group and will only include the groups for which it evaluates to true.

Syntax

Group_By_Clause :=                                                                                       
    'GROUP' 'BY' Expression_List [Having_Clause]
Having_Clause := 'HAVING' Boolean_Expression

Remarks

  • Expression_List
    Provides the list of expressions and column references that define the hierarchical grouping. The groups are built from left to right: the left most expression defines the outer group, every subsequent expression defines a group inside the previous group. In the end all groups are flattened into a set of groups on which the aggregations will occur.

    Note that the SELECT clause in a SELECT expression with a GROUP BY has to contain only the columns or expressions that appear in the GROUP BY’s expression list and the aggregation expressions. Otherwise an error is raised. Expressions or column references in the GROUP BY do not have to be in the SELECT clause, even though they are being used to define the groups.

  • Having_Clause
    The HAVING clause offers an optional filter clause that will filter the rowset at the group level. In other words, the predicate in the HAVING clause will be applied to the group and will only include the groups for which it evaluates to true.

Examples

  • The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
  • The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.

The following query uses the GROUP BY clause to provide us with the total number of employees in each department.

@employees = SELECT *
               FROM (VALUES   
                      ("Rafferty", (int?) 31)  
                    , ("Jones", (int?) 33)  
                    , ("Heisenberg", (int?) 33)  
                    , ("Robinson", (int?) 34)  
                    , ("Smith", (int?) 34)  
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);  
                      
@departments = SELECT *  
                FROM (VALUES  
                       ((int) 31, "Sales")  
                     , ((int) 33, "Engineering")  
                     , ((int) 34, "Clerical")  
                     , ((int) 35, "Marketing")) AS D(DepID, DepName);  
  
@rs_group_by =   
    SELECT d.DepName, COUNT(EmpName) AS TotalEmployees  
    FROM (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d   
          LEFT OUTER JOIN @employees AS e  
         ON d.DepID == e.DepID  
    GROUP BY d.DepName;  
  
OUTPUT @rs_group_by   
TO "/output/rsGroupBy.csv"  
USING Outputters.Csv();

The following query uses the GROUP BY and HAVING clauses to provide us with the number of employees in each department that has at least one employee.

@employees = SELECT *  
               FROM (VALUES   
                      ("Rafferty", (int?) 31)  
                    , ("Jones", (int?) 33)  
                    , ("Heisenberg", (int?) 33)  
                    , ("Robinson", (int?) 34)  
                    , ("Smith", (int?) 34)  
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);  
                      
@departments = SELECT *  
                FROM (VALUES  
                       ((int) 31, "Sales")  
                     , ((int) 33, "Engineering")  
                     , ((int) 34, "Clerical")  
                     , ((int) 35, "Marketing")) AS D(DepID, DepName);  
  
@rs_having =   
    SELECT d.DepName, COUNT(EmpName) AS TotalEmployees  
    FROM (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d   
          LEFT OUTER JOIN @employees AS e  
         ON d.DepID == e.DepID  
    GROUP BY d.DepName  
    HAVING COUNT(EmpName) > 0;  
  
OUTPUT @rs_having   
TO "/output/rsHaving.csv"  
USING Outputters.Csv();

See Also