U-SQL Scripts

Summary

The main unit of a U-SQL "program" is a U-SQL script. A script consists of a sequence of U-SQL statements.

Syntax

Script :=                                                                                                
     Statement_List.

Remarks

  • Statement_List
    The main body of a script is a sequence of statements:

Syntax

  Statement_List :=                                                                                   
       { [Statement] ';' }.
  

Each statement is terminated by a semi-colon (;). Statements can be empty or one of

Syntax

  Statement :=                                                                                        
       Use_Statement  
  |    If_Else_Statement 
  |    Declare_Variable_Statement  
  |    Reference_Assembly_Statement  
  |    Deploy_Resource_Statement 
  |    DDL_Statement  
  |    Query_Statement  
  |    Procedure_Call  
  |    Import_Package_Statement  
  |    DML_Statement  
  |    Output_Statement.
  

For more details on the individual statements follow the links to their definitions.

Note that some statements have restrictions on how they can be combined inside a script. For example, you cannot create a table and read from the same table in the same script, since the compiler requires that any input already physically exists at compile time of the query.

Examples

The following shows an extended sample script that operates on one of the U-SQL sample data sets:

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";  
DECLARE @out string = "/output/result.tsv";  
  
@searchlog =  
    EXTRACT UserId          int,  
            Start           DateTime,  
            Region          string,  
            Query           string,  
            Duration        int?,  
            Urls            string,  
            ClickedUrls     string  
    FROM @in  
    USING Extractors.Tsv();  
  
@rs1 =  
    SELECT Start, Region, Duration  
    FROM @searchlog  
    WHERE Region == "en-gb";  
  
@rs1 =  
    SELECT Start, Region, Duration  
    FROM @rs1  
    WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");  
  
OUTPUT @rs1     
TO @out  
USING Outputters.Tsv();

The script starts with two DECLARE variable statements that declare the variables @in and @out that contain the input and output path names respectively. Then it assigns the variable @searchlog to the EXTRACT expression which then will be used in the next SELECT expression, which does a filter on the Region column. This SELECT expression will be assigned to the variable @rs1. The @rs1 variable will be further refined with the next SELECT expression that adds a range filter on the Start property before it is written to the output file using an OUTPUT statement.

Hopefully, you have stumbled over the wording "assigned the variable to the statement" in the paragraph above. This wording is calling out an important aspect of U-SQL scripts and the processing of a U-SQL script. While it seems that we assign the resulting rowset of each expression to a variable that we then process in the next step, in fact U-SQL is NOT calculating the result at this point in the script.

Instead each variable you see is actually a name for the expression that it is being assigned to. What this means is that U-SQL is not executing each statement sequentially, but composes bigger and bigger statements until there is a big expression tree that then gets optimized and executed as a whole. In language theory this is also known as lambda expression composition.

As a consequence, a single U-SQL script can contain several such expression trees that will be all executed in parallel without producing intermediate state. This gives the U-SQL execution the ability to optimize expressions across the whole script. For example, if none of the later expressions refer to a column that gets extracted in the beginning, the optimizer may decide to not actually extract the column, or a predicate that gets applied to a later rowset is being pushed to be executed much earlier to reduce the amount of data that gets processed in the intermediate steps.

However, it also means that one cannot perform intermediate investigations of state to determine different workflows inside the script. Such investigations either have to be written within the declarative framework or be delegated to an external workflow that will orchestrate individual scripts that produce intermediate snapshot results in files to help with the workflow orchestration.

The following images show the logical expression tree and the physical job graph of the above script as shown in the Azure DataLake Tool for VisualStudio.

Logical Expression Tree

U-SQL Logical Job Execution Tree

Physical Job Graph

U-SQL Physical Job Execution Tree

See Also