CREATE DATA SOURCE (U-SQL)
Summary
U-SQL provides the CREATE DATA SOURCE
statement to create a data source. Data Sources give U-SQL the ability to query data from other data sources such as other databases. It contains the connection information as well as information about its capabilities to execute query expressions passed and translated from U-SQL to the its local query engine. For example, U-SQL allows to create a data source on an Azure SQL Database which then allows U-SQL queries to select from tables in that database.
Note
In order for the data source to be accessible in federated queries, the firewall rules of the external data source has to provide access to Azure Data Lake Analytics. See the descriptions under External_Source for more details.
Syntax
Create_Datasource_Statement := 'CREATE' 'DATA' 'SOURCE' ['IF' 'NOT' 'EXISTS'] Datasource_Name 'FROM' External_Source 'WITH' '(' Datasource_Options ')'.
Datasource_Name := Quoted_or_Unquoted_Identifier.
Remarks
Datasource_Name
Specifies the name of the data source to be created in the current static database context. If an object of the given name already exists in the database context or the user has no permissions to create a data source, an error is raised.IF NOT EXISTS
If the optionalIF NOT EXISTS
is specified, then the statement creates the data source if it does not already exist, or succeeds without changes if the data source already exists and the user has permission to at least enumerate all existing data sources.External_Source
Specifies the type of data source that gets created. This keyword indicates what type of remote capabilities are available for executing federated queries and implies the options that can be specified in the data source options.Currently supported are:
Syntax
External_Source := 'AZURESQLDB' | 'AZURESQLDW' | 'SQLSERVER'.
External_Source | Description |
---|---|
AZURESQLDB | Specifies that the external source is a Microsoft Azure SQL Database instance. |
AZURESQLDW | Specifies that the external source is a Microsoft Azure SQL Datawarehouse instance. |
SQLSERVER | Specifies that the external source is a Microsoft SQL Server instance running in an accessible Microsoft Azure VM. Currently SQL Server 2012 and newer are supported. |
In order to provide U-SQL access to query the SQL Server instance, provide access to the resources (open the firewall) to the Azure Datalake Analytics services, as shown.
AZURESQLDB and AZURESQLDW
Configure the SQL Server (not the database) firewall to allow access to Azure Services. The following screenshot shows the option:SQLSERVER
Configure the Azure VM firewall to open the SQL Server endpoint to the IP range specific to your data region where you are running your U-SQL scripts for SQL Server traffic. To identify your region, see the Review Region example further below.The current IP ranges are:
Region IP Ranges Additional IP lists Europe North 104.44.91.64/27 Additional Public IPs_Europe North Europe West 104.44.93.192/27 Additional Public IPs_Europe West US Central 104.44.91.160/27, 40.90.144.0/27, 40.90.154.115/114 Additional Public IPs_US Central US East 2 104.44.91.96/27, 40.90.144.64/26, 13.105.23.105, 13.105.23.103, 13.105.23.100 Additional Public IPs_US East 2 For example, the following screenshot show an example configuration when configuring an Azure VM with SQL Server for the US East 2 region:
If, after this configuration, the following error is raised when querying a SQLSERVER data source:
Error number: 10060, message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
check the following possible causes:
- Make sure that the credential information for the data source object is correct provides access to the SQL Server instance.
- Make sure that the data source is correctly configured and that the provided provider string information is correct.
- If both a and b are correct, the U-SQL job may be running from a machine that has not been allowed access through the firewall. If the firewall step above was followed and the error still occurs, please contact your Microsoft Support contact to get the new IP ranges.
Datasource_Options
Specifies the options that are being used to connect and interact with the external data source. These options are source specific (although at the moment they all apply to the three supported sources).
Syntax
Datasource_Options := Datasource_Option {',' Datasource_Option}.
Datasource_Option := 'CREDENTIAL' '=' Credential_Identifier 'PROVIDER_STRING' '=' lexical_provider_string. 'REMOTABLE_TYPES' '=' '(' [Remotable_Types] ')'.
Remotable_Types := Simple_Type {',' Simple_Type }.
CREDENTIAL
(mandatory)
TheCREDENTIAL
option has to be specified exactly once and refers to the credential of the given name in the current static database context. If the credential is not specified, or specified more than once, does not exist, or the user has no permissions to use it, an error is raised.The credential object contains the information about the data source, the port, the external data source's user id and password that will be used to populate the ADO.Net Provider String's
Server
,UserID
andPassword
options.See New-AzureRmDataLakeAnalyticsCatalogCredential for information on how to create a new Azure Data Lake Analytics catalog credential.
PROVIDER_STRING
ThePROVIDER_STRING
data source option allows the user to specify additional ADO.Net connection string options besides the ones that are being provided through the CREDENTIAL option. If theData Source
(and its aliases, includingServer
),UserID/UID
orPassword/PWD
options are specified in the PROVIDER_STRING, an error is raised.The
PROVIDER_STRING
data source option can be specified at most once. If it is left away, the defaults listed below are set. If it has been specified more than once, an error is raised.The following table provides the connection string options that can be provided in the connection string and their defaults (see SqlConnection.ConnectionString Property for their meaning):
Option Default App/Application Name Azure Data Lake Analytics Current Language/Language N/A Initial Catalog/Database N/A Type System Version N/A In addition, U-SQL sets some of the connection string options for its own processing that cannot be overwritten. If they appear in the PROVIDER_STRING they are overwritten with the default or an error is raised (see table below). These reserved options are:
Option Enforced Default Error raised? ApplicationIntent ReadOnly No Connect Timeout/Connection Timeout/Timeout 30 seconds No ConnectRetryCount 3 during Compilation; 6 during Execution No ConnectRetryInterval 3 to 7 seconds No Encrypt True Yes Enlist False No Integrated Security/Trusted_Connection False Yes Persist Security Info/PersistSecurityInfo False No TrustServerCertificate False Yes Important
The external source SQLSERVER allows overwriting Encrypt and TrustServerCertificate, but a severe warning is raised and it is strongly advised against overwriting these options.
Any option that is not listed in the tables above or provided by means of the CREDENTIAL object will be set to its default value according to the SqlConnection.ConnectionString Property documentation.
REMOTABLE_TYPES
TheREMOTABLE_TYPES
option specifies the list of simple built-in U-SQL types for which the U-SQL query evaluation will push supported expressions to the external data source. For example, if the option specifiesint
andstring
as a remotable type but notdouble
then comparisons, some known string expressions and other supported expressions on columns mapped toint
andstring
will be executed remotely, preserving the C# semantics as closely as possible, while the expressions on columns mapped to double will be executed in U-SQL. It can be specified at most once. If it is not specified, then no expressions will be sent to the external data source for remote execution. If it is specified more than once, an error is raised.
Important
CREATE DATA SOURCE
itself does not check if the remote source is accessible with the provided credentials and provider settings. Currently the connection is checked with CREATE EXTERNAL TABLE and when selecting from an external data source.
Examples
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
Review Region
Use the following PowerShell command to confirm the Region associated with your Azure Data Lake Analytics account.
Login-AzureRmAccount;
$DataLakeAnalyticsAccount = "<adla_account>";
(Get-AzureRmDataLakeAnalyticsAccount -Name $DataLakeAnalyticsAccount).Location;
Create Credential
The New-AzureRmDataLakeAnalyticsCatalogCredential cmdlet creates a new credential to use in an Azure Data Lake Analytics catalog for connecting to external data sources.
The syntax is illustrated below:
Login-AzureRmAccount;
Set-AzureRMContext -SubscriptionId <SubscriptionId>;
# AZURESQLDB
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<adla_account>" -DatabaseName "TestReferenceDB" -CredentialName "MyAzureSQLDBCredential" -Credential (Get-Credential) -DatabaseHost "<AZURESQLDB>.database.windows.net" -Port 1433;
# AZURESQLDW
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<adla_account>" -DatabaseName "TestReferenceDB" -CredentialName "MyAzureSQLDWCredential" -Credential (Get-Credential) -DatabaseHost "<AZURESQLDW>.database.windows.net" -Port 1433;
# SQLSERVER
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<adla_account>" -DatabaseName "TestReferenceDB" -CredentialName "MySQLServerCredential" -Credential (Get-Credential) -DatabaseHost "<IP>" -Port 1433;
Create Data Source - AZURESQLDB
USE DATABASE TestReferenceDB;
CREATE DATA SOURCE MyAzureSQLDBDataSource
FROM AZURESQLDB
WITH
(
PROVIDER_STRING = "Database=AdventureWorksLT;Trusted_Connection=False;Encrypt=True",
CREDENTIAL = MyAzureSQLDBCredential,
REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);
Create Data Source - AZURESQLDW
USE DATABASE TestReferenceDB;
CREATE DATA SOURCE IF NOT EXISTS MyAzureSQLDWDataSource
FROM AZURESQLDW
WITH
(
PROVIDER_STRING = "Database=AdventureWorksDW;Trusted_Connection=False;Encrypt=True",
CREDENTIAL = MyAzureSQLDWCredential,
REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);
Create Data Source - SQLSERVER
USE DATABASE TestReferenceDB;
DROP DATA SOURCE IF EXISTS MySQLServerDataSource;
CREATE DATA SOURCE MySQLServerDataSource
FROM SQLSERVER
WITH
(
PROVIDER_STRING = "Database=AdventureWorks2016;Trusted_Connection=False;Encrypt=True;TrustServerCertificate=True",
CREDENTIAL = MySQLServerCredential,
REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);
Query Data Source - Method 1
USE DATABASE TestReferenceDB;
@results =
SELECT *
FROM EXTERNAL MyAzureSQLDBDataSource EXECUTE
@"SELECT @@SERVERNAME AS serverName, GETDATE() AS dayTime, DB_NAME() AS databaseName, * FROM dbo.BuildVersion WITH (NOLOCK)";
OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query1A.csv"
USING Outputters.Csv(outputHeader: true);
@results =
SELECT *
FROM EXTERNAL MyAzureSQLDWDataSource EXECUTE
@"SELECT GETDATE() AS dayTime, DB_NAME() AS databaseName, * FROM dbo.AdventureWorksDWBuildVersion WITH (NOLOCK)";
OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query1B.csv"
USING Outputters.Csv(outputHeader: true);
@results =
SELECT *
FROM EXTERNAL MySQLServerDataSource EXECUTE
@"SELECT @@SERVERNAME AS serverName, GETDATE() AS dayTime, DB_NAME() AS databaseName, * FROM dbo.AWBuildVersion WITH (NOLOCK)";
OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query1C.csv"
USING Outputters.Csv(outputHeader: true);
Query Data Source - Method 2
USE DATABASE TestReferenceDB;
@results =
SELECT DateTime.Now AS dayTime, *
FROM EXTERNAL MyAzureSQLDBDataSource LOCATION "dbo.BuildVersion";
OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query2A.csv"
USING Outputters.Csv(outputHeader: true);
@results =
SELECT DateTime.Now AS dayTime, *
FROM EXTERNAL MyAzureSQLDWDataSource LOCATION "dbo.AdventureWorksDWBuildVersion";
OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query2B.csv"
USING Outputters.Csv(outputHeader: true);
@results =
SELECT *
FROM EXTERNAL MySQLServerDataSource LOCATION "dbo.AWBuildVersion";
OUTPUT @results
TO "/Output/ReferenceGuide/DDL/DataSources/Query2C.csv"
USING Outputters.Csv(outputHeader: true);