CREATE TABLE (U-SQL): Creating a Table with Schema

Summary

U-SQL allows a managed table to be created by specifying a schema. The table will have to have a clustered index specified in order to be able to contain data and the table will be partitioned.

Note

All managed U-SQL tables are currently clustered tables where the cluster information is specified with a clustered index. In particular, other types of tables such as heaps and column store tables are not supported.

When creating a managed U-SQL table with a schema, a table schema has to be provided that contains at least one table column definition.

Syntax

Create_Managed_Table_With_Schema_Statement :=                                                            
    'CREATE' 'TABLE' ['IF' 'NOT' 'EXISTS'] Identifier 
    Table_With_Schema.
Table_With_Schema := '(' { Column_Definition ',' } [ Table_Index Partition_Specification ] { ',' Column_Definition } ')' | '(' { Column_Definition ',' } [ Table_Index ] { ',' Column_Definition } ')' Partition_Specification.

Remarks

  • Identifier
    Specifies the name of the schema. If the Identifier is a three-part identifier, the table will be created in the specified database and schema. If it is a two-part identifier, then the table will be created in the specified schema of the current database context. If the identifier is a simple identifier, then the table will be created in the current database and schema context.

    If a table or other object of the given name already exists in the specified database and schema context or the user has no permissions to create a table, an error is raised.

  • IF NOT EXISTS
    If the optional IF NOT EXISTS is specified, then the statement creates the table if it does not already exist, or succeeds without changes if the table already exists and the user has permission to at least enumerate all existing tables.

  • Table_With_Schema
    A table schema contains at least one column definition (note the above syntax is slightly simplified for readability) and can optionally contain a table index and a partition specification. The table index is optional in the definition of the table and the partition specification can either follow the index specification or can follow the table schema. If the partition specification follows the schema, then the schema needs to contain the index specification.

    Note

    While the index definition is optional as part of the table definition, no data can be inserted into the table until an index has been defined. If the index definition is not part of the table definition, then a CREATE CLUSTERED INDEX statement has to be executed before data can be inserted into the table.

  • Column_Definition
    A column definition is of the form

Syntax

  Column_Definition :=                                                                                
      Quoted_or_Unquoted_Identifier Built_in_Type.
  

Each column has an identifier that can be either a quoted or unquoted identifier which is typed with one of the built-in U-SQL types. Note that there are currently no constraints such as primary key, foreign key, unique etc. supported. Unlike in the case of traditional SQL tables, nullability is part of the type and not a column property.

  • Table_Index
    The table index defines the clustered index of the table. It specifies the name of the index that is local to the table as a quoted or unquoted identifier and by providing a list of columns that determine how and in which order the rows will be ordered. It basically determines how the data will be physically stored in the clustered table.

    The table index syntax looks like

Syntax

  Table_Index :=                                                                                      
      'INDEX' Quoted_or_Unquoted_Identifier   
      'CLUSTERED' '(' Sort_Item_List ')'.
Sort_Item_List := Sort_Item {',' Sort_Item}.
Sort_Item := Quoted_or_Unquoted_Identifier [Sort_Direction].
Sort_Direction := 'ASC' | 'DESC'.
  • Partition_Specification
    The partition specification provides information how the data inside the table is being partitioned and distributed.

    If the table index is being provided, the partition specification has to be provided as well. The syntax looks like:

Syntax

  Partition_Specification :=                                                                          
      [ 'PARTITIONED' ['BY'] '(' Identifier_List ')' ]   
      Distribution_Specification.
Identifier_List := Quoted_or_Unquoted_Identifier {',' Quoted_or_Unquoted_Identifier}.
Distribution_Specification := 'DISTRIBUTED' ['BY'] Distribution_Scheme ['INTO' integer_or_long_literal].
Distribution_Scheme := 'RANGE' '(' Sort_Item_List ')' | 'HASH' '(' Identifier_List ')' | 'DIRECT' 'HASH' '(' Identifier ')' | 'ROUND' 'ROBIN'.

For detailed semantics about partitioning see the section on partitioning.

U-SQL Table Partitions and Distributions

U-SQL Tables can be partitioned in two-levels: the higher-level coarse-grained partitioning into addressable partitions and the lower-level fine grained distribution within a table or a partition.

U-SQL Table Partition

Many use cases around data life cycle management, such as loading of daily or hourly data, require that one can add individual partitions and manage them separately. This is best done with partitioning a table into individually addressable partitions. Each of these partitions have to then explicitly be added with ALTER TABLE ADD PARTITION and can be removed with ALTER TABLE DROP PARTITION. In addition, the query processor will perform partition elimination on supported predicates.

Currently U-SQL only allows partitioning in conjunction with HASH and ROUND ROBIN distributions.

Note that the types of the columns used to specify partitioning schemes have to be comparable. In particular that means that the type cannot be a complex type. Additionally, in order to provide precise and deterministic semantics for partitioning on DateTime values, the values used to partition on columns of DateTime types have to have their DateTimeKind set to DateTimeKind.Utc.

U-SQL Table Distributions

Every table has to provide at least a distribution scheme that will partition the data inside the table (or inside each partition) according to the specified scheme.

Currently U-SQL supports four distribution schemes:

  • RANGE
    Based on a set of ordered columns. Tables distributed by RANGE needs to provide a list of columns that will determine the ranges over which to distribute. The system will determine the bounds for the distributions. Each distribution contains all the rows between those bounds, according to the sort order given and an artificial MAX and MIN value to close the first and last distributions. For any distribution P that has a left boundary value L and a right boundary value R, P will contain the range from L inclusive to R-1. The value R will be in the next distribution. If the data is ordered in ascending order, then L corresponds to the lower bound and R to the upper bound, if the data is ordered in descending order then L corresponds to the high value and R to the low value. The column order is significant in determining the distribution and ASC/DESC optionally specifies the order and is defaulted to ascending.

  • HASH
    Based on a set of columns. Tables distributed by HASH requires a list of columns that will be used to route each row individually to a distributions based on a hash of the columns specified.

  • DIRECT HASH
    Based on single column of an integral type. The DIRECT HASH distribution scheme provides direct control of mapping a row into the distribution by using the value in the integer column as the distribution bucket id.

  • ROUND ROBIN
    ROUND ROBIN assigns rows to distributions individually in round robin fashion without reference to the values they contain. Each distribution should have approximately the same number of rows.

Note that each of the distribution schemes has certain advantages based on the characteristics of the data and the type of queries most frequently run against it. In particular, they should be chosen for distribution sizing, filter predicate selectivity and join comparisons to reduce data reshuffling in case of distribution misalignments.

For example, a range distribution is more likely helping if the query is looking for data in a single distribution using either a range or point query, and a hash distribution is beneficial if point queries are used, but they can introduce data skew if the data is unevenly distributed. Such data skew can hurt query performance if the query needs to query across many distributions. In that case ROUND ROBIN will eliminate most of the data skew and provide better performance.

The INTO clause specifies the number of buckets for the distribution schemes. The value has to be greater than or equal to 2 and less than or equal to 2500. If it is omitted, then U-SQL will give the query processor the flexibility to use as many distribution buckets as appropriate to avoid some costly reshuffling of data during the insertion. The query processor will choose the default number for hash distribution buckets depending on the size of the first data to be inserted into the table in the following way:

U-SQL has a short list of candidate bucket numbers ({ 2, 10, 20, 60, 120, 240, 480 }). It chooses the default from this list based on the estimated data size for the first insert and the upper bound average distribution size limit of 2GB. It uses the smallest number from the list which will produce the average distribution size smaller or equal to the 2GB upper bound.

If the INTO clause is specified, then the data will be distributed into the specified number.

Examples

  • The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
  • The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.

Basic Syntax

CREATE TABLE dbo.Employees
(
    EmpID int,
    EmpName string,
    DeptID int,
    Salary int?,
    StartDate DateTime,
    INDEX clx_EmpID CLUSTERED(EmpID ASC)
)
DISTRIBUTED BY HASH(EmpID);

Basic Syntax - Alternative Method

CREATE TABLE dbo.Employees
(
    EmpID int,
    EmpName string,
    DeptID int,
    Salary int?,
    StartDate DateTime,
    INDEX clx_EmpID CLUSTERED(EmpID ASC) DISTRIBUTED BY HASH(EmpID)
);

Partitioned Table
This examples creates a partitioned table. Data will be distributed over OrderID, CustomerID and partitioned by OrderDate.

CREATE DATABASE IF NOT EXISTS TestReferenceDB;
USE DATABASE TestReferenceDB; 

DROP TABLE IF EXISTS dbo.Orders;
CREATE TABLE dbo.Orders
(
    OrderID int,
    CustomerID int,
    OrderDetailID int,
    OrderTotal double,
    OrderDate DateTime,
    INDEX clx_OrderID_CustomerID CLUSTERED(OrderID, CustomerID ASC)
)
PARTITIONED BY (OrderDate)
DISTRIBUTED BY HASH (OrderID, CustomerID) 
INTO 10;

See Also