CREATE DATA SOURCE (U-SQL)

Summary

U-SQL provides the CREATE DATA SOURCE statement to create a data source. Data Sources give U-SQL the ability to query data from other data sources such as other databases. It contains the connection information as well as information about its capabilities to execute query expressions passed and translated from U-SQL to the its local query engine. For example, U-SQL allows to create a data source on an Azure SQL Database which then allows U-SQL queries to select from tables in that database.

Note

In order for the data source to be accessible in federated queries, the firewall rules of the external data source has to provide access to Azure Data Lake Analytics. See the descriptions under External_Source for more details.

Syntax

Create_Datasource_Statement :=                                                                           
    'CREATE' 'DATA' 'SOURCE' ['IF' 'NOT' 'EXISTS'] Datasource_Name  
    'FROM' External_Source   
    'WITH' '(' Datasource_Options ')'.
Datasource_Name := Quoted_or_Unquoted_Identifier.

Remarks

  • Datasource_Name
    Specifies the name of the data source to be created in the current static database context. If an object of the given name already exists in the database context or the user has no permissions to create a data source, an error is raised.

  • IF NOT EXISTS
    If the optional IF NOT EXISTS is specified, then the statement creates the data source if it does not already exist, or succeeds without changes if the data source already exists and the user has permission to at least enumerate all existing data sources.

  • External_Source
    Specifies the type of data source that gets created. This keyword indicates what type of remote capabilities are available for executing federated queries and implies the options that can be specified in the data source options.

    Currently supported are:

Syntax

  External_Source :=                                                                                  
      'AZURESQLDB' | 'AZURESQLDW' | 'SQLSERVER'.
External_Source Description
AZURESQLDB Specifies that the external source is a Microsoft Azure SQL Database instance.
AZURESQLDW Specifies that the external source is a Microsoft Azure SQL Datawarehouse instance.
SQLSERVER Specifies that the external source is a Microsoft SQL Server instance running in an accessible Microsoft Azure VM. Currently SQL Server 2012 and newer are supported.

In order to provide U-SQL access to query the SQL Server instance, provide access to the resources (open the firewall) to the Azure Datalake Analytics services, as shown.

  • AZURESQLDB and AZURESQLDW
    Configure the SQL Server (not the database) firewall to allow access to Azure Services. The following screenshot shows the option:

    Azure SQL DB Firewall Settings

  • SQLSERVER
    Configure the Azure VM firewall to open the SQL Server endpoint to the IP range specific to your data region where you are running your U-SQL scripts for SQL Server traffic. To identify your region, see the Review Region example further below.

    The current IP ranges are:

    Region IP Ranges Additional IP lists
    Europe North 104.44.91.64/27 Additional Public IPs_Europe North
    Europe West 104.44.93.192/27 Additional Public IPs_Europe West
    US Central 104.44.91.160/27, 40.90.144.0/27, 40.90.154.115/114 Additional Public IPs_US Central
    US East 2 104.44.91.96/27, 40.90.144.64/26, 13.105.23.105, 13.105.23.103, 13.105.23.100 Additional Public IPs_US East 2

    For example, the following screenshot show an example configuration when configuring an Azure VM with SQL Server for the US East 2 region:

    SQL Server Azure VM Firewall Setting

    If, after this configuration, the following error is raised when querying a SQLSERVER data source:

    Error number: 10060, message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
    

    check the following possible causes:

    1. Make sure that the credential information for the data source object is correct provides access to the SQL Server instance.
    2. Make sure that the data source is correctly configured and that the provided provider string information is correct.
    3. If both a and b are correct, the U-SQL job may be running from a machine that has not been allowed access through the firewall. If the firewall step above was followed and the error still occurs, please contact your Microsoft Support contact to get the new IP ranges.

  • Datasource_Options
    Specifies the options that are being used to connect and interact with the external data source. These options are source specific (although at the moment they all apply to the three supported sources).

Syntax

  Datasource_Options :=                                                                               
      Datasource_Option {',' Datasource_Option}.
Datasource_Option := 'CREDENTIAL' '=' Credential_Identifier 'PROVIDER_STRING' '=' lexical_provider_string. 'REMOTABLE_TYPES' '=' '(' [Remotable_Types] ')'.
Remotable_Types := Simple_Type {',' Simple_Type }.
  • CREDENTIAL (mandatory)
    The CREDENTIAL option has to be specified exactly once and refers to the credential of the given name in the current static database context. If the credential is not specified, or specified more than once, does not exist, or the user has no permissions to use it, an error is raised.

    The credential object contains the information about the data source, the port, the external data source's user id and password that will be used to populate the ADO.Net Provider String's Server, UserID and Password options.

    See New-AzureRmDataLakeAnalyticsCatalogCredential for information on how to create a new Azure Data Lake Analytics catalog credential.

  • PROVIDER_STRING
    The PROVIDER_STRING data source option allows the user to specify additional ADO.Net connection string options besides the ones that are being provided through the CREDENTIAL option. If the Data Source (and its aliases, including Server), UserID/UID or Password/PWD options are specified in the PROVIDER_STRING, an error is raised.

    The PROVIDER_STRING data source option can be specified at most once. If it is left away, the defaults listed below are set. If it has been specified more than once, an error is raised.

    The following table provides the connection string options that can be provided in the connection string and their defaults (see SqlConnection.ConnectionString Property for their meaning):

    Option Default
    App/Application Name Azure Data Lake Analytics
    Current Language/Language N/A
    Initial Catalog/Database N/A
    Type System Version N/A

    In addition, U-SQL sets some of the connection string options for its own processing that cannot be overwritten. If they appear in the PROVIDER_STRING they are overwritten with the default or an error is raised (see table below). These reserved options are:

    Option Enforced Default Error raised?
    ApplicationIntent ReadOnly No
    Connect Timeout/Connection Timeout/Timeout 30 seconds No
    ConnectRetryCount 3 during Compilation; 6 during Execution No
    ConnectRetryInterval 3 to 7 seconds No
    Encrypt True Yes
    Enlist False No
    Integrated Security/Trusted_Connection False Yes
    Persist Security Info/PersistSecurityInfo False No
    TrustServerCertificate False Yes

    Important

    The external source SQLSERVER allows overwriting Encrypt and TrustServerCertificate, but a severe warning is raised and it is strongly advised against overwriting these options.

    Any option that is not listed in the tables above or provided by means of the CREDENTIAL object will be set to its default value according to the SqlConnection.ConnectionString Property documentation.

  • REMOTABLE_TYPES
    The REMOTABLE_TYPES option specifies the list of simple built-in U-SQL types for which the U-SQL query evaluation will push supported expressions to the external data source. For example, if the option specifies int and string as a remotable type but not double then comparisons, some known string expressions and other supported expressions on columns mapped to int and string will be executed remotely, preserving the C# semantics as closely as possible, while the expressions on columns mapped to double will be executed in U-SQL. It can be specified at most once. If it is not specified, then no expressions will be sent to the external data source for remote execution. If it is specified more than once, an error is raised.

Important

CREATE DATA SOURCE itself does not check if the remote source is accessible with the provided credentials and provider settings. Currently the connection is checked with CREATE EXTERNAL TABLE and when selecting from an external data source.

Examples

Review Region
Use the following PowerShell command to confirm the Region associated with your Azure Data Lake Analytics account.

Login-AzureRmAccount;

$DataLakeAnalyticsAccount = "<adla_account>";
(Get-AzureRmDataLakeAnalyticsAccount -Name $DataLakeAnalyticsAccount).Location;

Create Credential
The New-AzureRmDataLakeAnalyticsCatalogCredential cmdlet creates a new credential to use in an Azure Data Lake Analytics catalog for connecting to external data sources. The syntax is illustrated below:

Login-AzureRmAccount;
Set-AzureRMContext -SubscriptionId <SubscriptionId>;

# AZURESQLDB
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<adla_account>" -DatabaseName "TestReferenceDB" -CredentialName "MyAzureSQLDBCredential" -Credential (Get-Credential) -DatabaseHost "<AZURESQLDB>.database.windows.net" -Port 1433;

# AZURESQLDW
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<adla_account>" -DatabaseName "TestReferenceDB" -CredentialName "MyAzureSQLDWCredential" -Credential (Get-Credential) -DatabaseHost "<AZURESQLDW>.database.windows.net" -Port 1433;

# SQLSERVER
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<adla_account>" -DatabaseName "TestReferenceDB" -CredentialName "MySQLServerCredential" -Credential (Get-Credential) -DatabaseHost "<IP>" -Port 1433;

Create Data Source - AZURESQLDB

USE DATABASE TestReferenceDB;

CREATE DATA SOURCE MyAzureSQLDBDataSource
FROM AZURESQLDB
WITH
(
    PROVIDER_STRING = "Database=AdventureWorksLT;Trusted_Connection=False;Encrypt=True",
    CREDENTIAL = MyAzureSQLDBCredential,
    REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);

Create Data Source - AZURESQLDW

USE DATABASE TestReferenceDB;

CREATE DATA SOURCE IF NOT EXISTS MyAzureSQLDWDataSource
FROM AZURESQLDW
WITH
(
    PROVIDER_STRING = "Database=AdventureWorksDW;Trusted_Connection=False;Encrypt=True",
    CREDENTIAL = MyAzureSQLDWCredential,
    REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);

Create Data Source - SQLSERVER

USE DATABASE TestReferenceDB;
DROP DATA SOURCE IF EXISTS MySQLServerDataSource;

CREATE DATA SOURCE MySQLServerDataSource
FROM SQLSERVER
WITH
(
    PROVIDER_STRING = "Database=AdventureWorks2016;Trusted_Connection=False;Encrypt=True;TrustServerCertificate=True",
    CREDENTIAL = MySQLServerCredential,
    REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);

Query Data Source - Method 1

USE DATABASE TestReferenceDB;

@results =
    SELECT *
    FROM EXTERNAL MyAzureSQLDBDataSource EXECUTE 
        @"SELECT @@SERVERNAME AS serverName, GETDATE() AS dayTime, DB_NAME() AS databaseName, * FROM dbo.BuildVersion WITH (NOLOCK)";

OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query1A.csv"
USING Outputters.Csv(outputHeader: true);


@results =
    SELECT *
    FROM EXTERNAL MyAzureSQLDWDataSource EXECUTE 
        @"SELECT GETDATE() AS dayTime, DB_NAME() AS databaseName, * FROM dbo.AdventureWorksDWBuildVersion WITH (NOLOCK)";

OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query1B.csv"
USING Outputters.Csv(outputHeader: true);


@results =
    SELECT *
    FROM EXTERNAL MySQLServerDataSource EXECUTE 
        @"SELECT @@SERVERNAME AS serverName, GETDATE() AS dayTime, DB_NAME() AS databaseName, * FROM dbo.AWBuildVersion WITH (NOLOCK)";

OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query1C.csv"
USING Outputters.Csv(outputHeader: true);

Query Data Source - Method 2

USE DATABASE TestReferenceDB;

@results =
    SELECT DateTime.Now AS dayTime, *
    FROM EXTERNAL MyAzureSQLDBDataSource LOCATION "dbo.BuildVersion";

OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query2A.csv"
USING Outputters.Csv(outputHeader: true);


@results =
    SELECT DateTime.Now AS dayTime, * 
    FROM EXTERNAL MyAzureSQLDWDataSource LOCATION "dbo.AdventureWorksDWBuildVersion";

OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query2B.csv"
USING Outputters.Csv(outputHeader: true);


@results =
    SELECT *
    FROM EXTERNAL MySQLServerDataSource LOCATION "dbo.AWBuildVersion";

OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query2C.csv"
USING Outputters.Csv(outputHeader: true);

See Also