Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest SQL, Fabric and Power BI learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Applies to:
SQL Server (Windows only)
Azure SQL Database
Azure Synapse Analytics
Analytics Platform System (PDW)
The article explains how to use PolyBase on a SQL Server instance to query external data in Azure Blob Storage.
If you haven't installed PolyBase, see PolyBase installation. The installation article explains the prerequisites.
In SQL Server 2022 (16.x), configure your external data sources to use new connectors when you connect to Azure Storage. The table below summarizes the change:
External Data Source | From | To |
---|---|---|
Azure Blob Storage | wasb[s] | abs |
ADLS Gen 2 | abfs[s] | adls |
First, configure SQL Server PolyBase to use Azure Blob Storage.
Run sp_configure with 'hadoop connectivity' set to an Azure Blob Storage provider. To find the value for providers, see PolyBase Connectivity Configuration. By Default, the Hadoop connectivity is set to 7.
-- Values map to various external data sources.
-- Example: value 7 stands for Hortonworks HDP 2.1 to 2.6 on Linux,
-- 2.1 to 2.3 on Windows Server, and Azure Blob Storage
sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
GO
RECONFIGURE
GO
Restart SQL Server using services.msc. Restarting SQL Server restarts these services:
Restart SQL Server using services.msc. Restarting SQL Server restarts these services:
To query the data in your Hadoop data source, you must define an external table to use in Transact-SQL queries. The following steps describe how to configure the external table.
Create a master key on the database. The master key is required to encrypt the credential secret.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong password>';
Create a database scoped credential for Azure Blob Storage; IDENTITY
can be anything as it's not used.
-- IDENTITY: any string (this is not used for authentication to Azure storage).
-- SECRET: your Azure storage account key.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'user', Secret = '<azure_storage_account_key>';
Create an external data source with CREATE EXTERNAL DATA SOURCE. Note that when connecting to the Azure Storage via the wasb[s]
connector, authentication must be done with a storage account key, not with a shared access signature (SAS).
-- LOCATION: Azure account storage account name and blob container name.
-- CREDENTIAL: The database scoped credential created above.
CREATE EXTERNAL DATA SOURCE AzureStorage with (
TYPE = HADOOP,
LOCATION ='wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
Create an external file format with CREATE EXTERNAL FILE FORMAT.
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
USE_TYPE_DEFAULT = TRUE))
Create an external table pointing to data stored in Azure storage with CREATE EXTERNAL TABLE. In this example, the external data contains car sensor data; LOCATION
can't be /
but /Demo/
as in this example doesn't need to exist previously.
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
[SensorKey] int NOT NULL,
[CustomerKey] int NOT NULL,
[GeographyKey] int NULL,
[Speed] float NOT NULL,
[YearMeasured] int NOT NULL
)
WITH (LOCATION='/Demo/',
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat
);
Create statistics on an external table.
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
Create a master key on the database. The master key is required to encrypt the credential secret.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong password>';
Create a database scoped credential for Azure Blob Storage using a shared access signature (SAS); IDENTITY
can be anything as it's not used.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = 'SHARED ACCESS SIGNATURE',
-- Remove ? from the beginning of the SAS token
SECRET = '<azure_shared_access_signature>' ;
Create an external data source with CREATE EXTERNAL DATA SOURCE. Note that when connecting to the Azure Storage via the WASB[s] connector, authentication with a shared access signature (SAS).
-- LOCATION: Azure account storage account name and blob container name.
-- CREDENTIAL: The database scoped credential created above.
CREATE EXTERNAL DATA SOURCE AzureStorage with (
LOCATION ='wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
Create an external file format with CREATE EXTERNAL FILE FORMAT.
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
USE_TYPE_DEFAULT = TRUE))
Create an external table pointing to data stored in Azure storage with CREATE EXTERNAL TABLE. In this example, the external data contains car sensor data; LOCATION
can't be /
but /Demo/
as in this example doesn't need to exist previously.
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
[SensorKey] int NOT NULL,
[CustomerKey] int NOT NULL,
[GeographyKey] int NULL,
[Speed] float NOT NULL,
[YearMeasured] int NOT NULL
)
WITH (LOCATION='/Demo/',
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat
);
Create statistics on an external table.
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
There are three functions that PolyBase is suited for:
The following queries provide example with fictional car sensor data.
The following ad hoc query joins relational with Hadoop data. It selects customers who drive faster than 35 mph, and joins to structured customer data stored in SQL Server with car sensor data stored in Hadoop.
SELECT DISTINCT Insured_Customers.FirstName,Insured_Customers.LastName,
Insured_Customers. YearlyIncome, CarSensor_Data.Speed
FROM Insured_Customers, CarSensor_Data
WHERE Insured_Customers.CustomerKey = CarSensor_Data.CustomerKey and CarSensor_Data.Speed > 35
ORDER BY CarSensor_Data.Speed DESC
OPTION (FORCE EXTERNALPUSHDOWN); -- or OPTION (DISABLE EXTERNALPUSHDOWN)
The following query imports external data into SQL Server. This example imports data for fast drivers into SQL Server to do more in-depth analysis. To improve performance, it leverages columnstore technology.
SELECT DISTINCT
Insured_Customers.FirstName, Insured_Customers.LastName,
Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus
INTO Fast_Customers from Insured_Customers INNER JOIN
(
SELECT * FROM CarSensor_Data where Speed > 35
) AS SensorD
ON Insured_Customers.CustomerKey = SensorD.CustomerKey
ORDER BY YearlyIncome
CREATE CLUSTERED COLUMNSTORE INDEX CCI_FastCustomers ON Fast_Customers;
The following query exports data from SQL Server to Azure Blob Storage. First enable PolyBase export. Then, create an external table for the destination before exporting data to it.
-- Enable INSERT into external table
sp_configure 'allow polybase export', 1;
reconfigure
-- Create an external table.
CREATE EXTERNAL TABLE [dbo].[FastCustomers2009] (
[FirstName] char(25) NOT NULL,
[LastName] char(25) NOT NULL,
[YearlyIncome] float NULL,
[MaritalStatus] char(1) NOT NULL
)
WITH (
LOCATION='/old_data/2009/customerdata',
DATA_SOURCE = HadoopHDP2,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
-- Export data: Move old data to Hadoop while keeping it query-able via an external table.
INSERT INTO dbo.FastCustomer2009
SELECT T.* FROM Insured_Customers T1 JOIN CarSensor_Data T2
ON (T1.CustomerKey = T2.CustomerKey)
WHERE T2.YearMeasured = 2009 and T2.Speed > 40;
PolyBase export with this method may create multiple files.
In SSMS, external tables are displayed in a separate folder External Tables. External data sources and external file formats are in subfolders under External Resources.
For more tutorials on creating external data sources and external tables to a variety of data sources, see PolyBase Transact-SQL reference.
Explore more ways to use and monitor PolyBase in the following articles:
Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest SQL, Fabric and Power BI learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayTraining
Module
Introduction to SQL Server 2022 data virtualization - Training
Learn about data virtualization, how to use Polybase to access and query external data, and enhanced Polybase features in SQL Server 2022.
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Documentation
Virtualize a delta table with PolyBase - SQL Server
Virtualize a delta table with PolyBase starting with SQL Server 2022.
PolyBase Transact-SQL reference - SQL Server
Use PolyBase to query your external data in Hadoop, Azure Blob Storage, Azure Data Lake Store, SQL Server, Oracle, Teradata, MongoDB, or CSV files.
CREATE EXTERNAL DATA SOURCE (Transact-SQL) - SQL Server
CREATE EXTERNAL DATA SOURCE creates an external data source used to establish connectivity and data virtualization from SQL Server and Azure SQL platforms.