Get started with U-SQL in Azure Data Lake Analytics

Articol
12/20/2023

Important

Azure Data Lake Analytics retired on 29 February 2024. Learn more with this announcement.

For data analytics, your organization can use Azure Synapse Analytics or Microsoft Fabric.

U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale. Through the scalable, distributed-query capability of U-SQL, you can efficiently analyze data across relational stores such as Azure SQL Database. With U-SQL, you can process unstructured data by applying schema on read and inserting custom logic and UDFs. Additionally, U-SQL includes extensibility that gives you fine-grained control over how to execute at scale.

Learning resources

The U-SQL Tutorial provides a guided walkthrough of most of the U-SQL language. This document is recommended reading for all developers wanting to learn U-SQL.
For detailed information about the U-SQL language syntax, see the U-SQL Language Reference.
To understand the U-SQL design philosophy, see the Visual Studio blog post Introducing U-SQL – A Language that makes Big Data Processing Easy.

Let's create some data

The following U-SQL script is simple and lets us create a sample data file that we can reference in our other U-SQL scripts. We'll discuss the key concepts that make this script go in the next section.

USE DATABASE master;
USE SCHEMA dbo;
@a  = 
    SELECT * FROM 
        (VALUES
            (399266, "2/15/2012 11:53:16 AM", "en-us", "microsoft", 73, "microsoft.com;xbox.com", "NULL"),
			(382045, "2/15/2012 11:53:18 AM", "en-gb", "azure data lake analytics", 614, "microsoft.com;portal.azure.com", "portal.azure.com"),
			(382045, "2/16/2012 11:53:20 AM", "en-gb", "usql", 74, "microsoft.com;github.com", "NULL"),
			(106479, "2/16/2012 11:53:50 AM", "en-ca", "xbox", 24, "xbox.com;xbox.com/xbox360", "xbox.com/xbox360"),
			(906441, "2/16/2012 11:54:01 AM", "en-us", "machine learning", 1213, "microsoft.com;github.com", "NULL"),
			(304305, "2/16/2012 11:54:03 AM", "en-us", "outlook", 60, "microsoft.com;office.com;live.com","microsoft.com"),
			(460748, "2/16/2012 11:54:04 AM", "en-us", "azure storage", 1270, "microsoft.com;portal.azure.com", "portal.azure.com"),
			(354841, "2/16/2012 11:59:01 AM", "en-us", "azure", 610, "microsoft.com;portal.azure.com", "portal.azure.com"),
			(354068, "2/16/2012 12:00:33 PM", "en-mx", "key vault", 422, "microsoft.com;portal.azure.com", "portal.azure.com"),
			(347413, "2/16/2012 12:11:55 PM", "en-gr", "github", 305, "github.com", "NULL"),
			(840614, "2/16/2012 12:13:56 PM", "en-us", "surface", 1220, "microsoft.com", "NULL"),
			(656666, "2/16/2012 12:15:55 PM", "en-us", "visual studio", 691, "microsoft.com;code.visualstudio.com", "NULL"),
			(951513, "2/16/2012 12:17:00 PM", "en-us", "power bi", 63, "microsoft.com;app.powerbi.com", "powerbi.com"),
			(350350, "2/16/2012 12:18:17 PM", "en-us", "data factory", 30, "microsoft.com;portal.azure.com", "NULL"),
			(641615, "2/16/2012 12:19:55 PM", "en-us", "event hubs", 119, "microsoft.com;portal.azure.com", "NULL")
        ) AS 
              D( UserId, Start, Region, Query, Duration, Urls, ClickedUrls );
OUTPUT @a
    TO "Samples/Data/SearchLog.tsv"
    USING Outputters.Tsv();

Read data from a file

Now that we have some data, let's read it from the file we created.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

OUTPUT @searchlog
    TO "/output/SearchLog-first-u-sql.csv"
    USING Outputters.Csv();

This script doesn't have any transformation steps. It reads from the source file called SearchLog.tsv, schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv.

Notice the question mark next to the data type in the Duration field. It means that the Duration field could be null.

Key concepts

Rowset variables: Each query expression that produces a rowset can be assigned to a variable. U-SQL follows the T-SQL variable naming pattern (@searchlog, for example) in the script.
The EXTRACT keyword reads data from a file and defines the schema on read. Extractors.Tsv is a built-in U-SQL extractor for tab-separated-value files. You can develop custom extractors.
The OUTPUT writes data from a rowset to a file. Outputters.Csv() is a built-in U-SQL outputter to create a comma-separated-value file. You can develop custom outputters.

File paths

The EXTRACT and OUTPUT statements use file paths. File paths can be absolute or relative:

This following absolute file path refers to a file in a Data Lake Store named mystore:

adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv

This following file path starts with "/". It refers to a file in the default Data Lake Store account:

/output/SearchLog-first-u-sql.csv

Use scalar variables

You can use scalar variables as well to make your script maintenance easier. The previous U-SQL script can also be written as:

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";
@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM @in
    USING Extractors.Tsv();
OUTPUT @searchlog
    TO @out
    USING Outputters.Csv();

Transform rowsets

Use SELECT to transform rowsets. This script will select the columns Start, Region, and Duration, and will write those rows to a file when Region has a value of "en-gb":

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
@rs1 =
    SELECT Start, Region, Duration
    FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
    TO "/output/SearchLog-transform-rowsets.csv"
    USING Outputters.Csv();

The WHERE clause uses a C# Boolean expression. You can use the C# expression language to do your own expressions and functions. You can even perform more complex filtering by combining them with logical conjunctions (ANDs) and disjunctions (ORs).

The following script uses the DateTime.Parse() method and a conjunction. columns Start, Region, and Duration, where Region has a value of "en-gb". Then it checks for values of the Duration column between certain dates and writes those values to a file:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
@rs1 =
    SELECT Start, Region, Duration
    FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
    SELECT Start, Region, Duration
    FROM @rs1
    WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");
OUTPUT @rs1
    TO "/output/SearchLog-transform-datetime.csv"
    USING Outputters.Csv();

Note

The second query is operating on the result of the first rowset, which creates a composite of the two filters. You can also reuse a variable name, and the names are scoped lexically.

Aggregate rowsets

U-SQL gives you the familiar ORDER BY, GROUP BY, and aggregations.

The following query finds the total duration per region, and then displays the top five durations in order.

U-SQL rowsets don't preserve their order for the next query. Thus, to order an output, you need to add ORDER BY to the OUTPUT statement:

DECLARE @outpref string = "/output/Searchlog-aggregation";
DECLARE @out1    string = @outpref+"_agg.csv";
DECLARE @out2    string = @outpref+"_top5agg.csv";
@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
@rs1 =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM @searchlog
GROUP BY Region;
@res =
    SELECT *
    FROM @rs1
    ORDER BY TotalDuration DESC
    FETCH 5 ROWS;
OUTPUT @rs1
    TO @out1
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();
OUTPUT @res
    TO @out2
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

The U-SQL ORDER BY clause requires using the FETCH clause in a SELECT expression.

The U-SQL HAVING clause can be used to restrict the output to groups that satisfy the HAVING condition:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
@res =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM @searchlog
    GROUP BY Region
    HAVING SUM(Duration) > 200;
OUTPUT @res
    TO "/output/Searchlog-having.csv"
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

For advanced aggregation scenarios, see the U-SQL reference documentation for aggregate, analytic, and reference functions.

Partajați prin