Virtualize delta table with PolyBase
Applies to: SQL Server 2022 (16.x) and later versions
SQL Server 2022 (16.x) can query data directly from a delta table folder. This concept, commonly referred to as data virtualization, allows the data to stay in its original location, but can be queried from a SQL Server instance with T-SQL commands like any other table. This feature uses PolyBase connectors, and minimizes the need for copying data via ETL processes.
In the following example, the delta table folder is stored on Azure Blob Storage and accessed via OPENROWSET or an external table.
For more information on data virtualization, Introducing data virtualization with PolyBase.
Preconfiguration
1. Enable PolyBase in sp_configure
exec sp_configure @configname = 'polybase enabled', @configvalue = 1;
RECONFIGURE;
2. Create a user database
This exercise creates a sample database with default settings and location. You use this empty sample database to work with the data and store the scoped credential. In this example, a new empty database named Delta_demo
is used.
CREATE DATABASE [Delta_demo];
3. Create a master key and database scoped credential
The database master key in the user database is required to encrypt the database scoped credential secret, delta_storage_dsc
. For this example, the delta table resides on Azure Data Lake Storage Gen2.
USE [Delta_demo];
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'password';
CREATE DATABASE SCOPED CREDENTIAL delta_storage_dsc
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<SAS Token>';
4. Create external data source
Database scoped credential is used for the external data source. In this example, the delta table resides in Azure Data Lake Storage Gen2, so use prefix adls
and the SHARED ACCESS SIGNATURE
identity method. For more information about the connectors and prefixes, including new settings for SQL Server 2022 (16.x), refer to CREATE EXTERNAL DATA SOURCE.
CREATE EXTERNAL DATA SOURCE Delta_ED
WITH (
LOCATION = 'adls://<container>@<storage_account>.dfs.core.windows.net',
CREDENTIAL = delta_storage_dsc
);
For example, if your storage account is named delta_lake_sample
and the container is named sink
, the code would be:
CREATE EXTERNAL DATA SOURCE Delta_ED
WITH (
LOCATION = 'adls://sink@delta_lake_sample.dfs.core.windows.net',
CREDENTIAL = delta_storage_dsc
);
Use OPENROWSET to access the data
In this example, the Data Table folder is named Contoso
.
Since the external data source Delta_ED
is mapped to a container level. The Contoso
delta table folder is located in a root. To query a file in a folder structure, provide a folder mapping relative to the external data source's LOCATION parameter.
SELECT * FROM OPENROWSET
(
BULK '/Contoso',
FORMAT = 'DELTA',
DATA_SOURCE = 'Delta_ED'
) AS [result];
Query data with an external table
CREATE EXTERNAL TABLE can also be used to virtualize the delta table data in SQL Server. The columns must be defined and strongly typed. While external tables take more effort to create, they also provide additional benefits over querying an external data source with OPENROWSET. You can:
- Strengthen the definition of the data typing for a given column
- Define nullability
- Define COLLATION
- Create statistics for a column to optimize the quality of the query plan
- Create a more granular model within SQL Server for data access to enhance your security model
For more information, see CREATE EXTERNAL TABLE.
For the following example, the same data source is used.
1. Create external file format
To define the file's formatting, an external file format is required. External file formats are also recommended due to reusability. For more information, see CREATE EXTERNAL FILE FORMAT.
CREATE EXTERNAL FILE FORMAT DeltaTableFormat WITH(FORMAT_TYPE = DELTA);
2. Create external table
The delta table files are located at /delta/Delta_yob/
and the external data source for this example is S3-compatible object storage, previously configured under the data source s3_eds
. PolyBase can use the as LOCATION the delta table folder or the absolute file itself, which would be located at delta/Delta_yob/_delta_log/00000000000000000000.json
.
-- Create External Table using delta
CREATE EXTERNAL TABLE extCall_Center_delta (
id INT,
name VARCHAR(200),
dob DATE
)
WITH (
LOCATION = '/delta/Delta_yob/',
FILE_FORMAT = DeltaTableFormat,
DATA_SOURCE = s3_eds
);
GO
Limitations
If you create an external table pointing to partitioned delta table, the column used for partitioning returns NULL
when querying the external table. However, if you use an OPENROWSET
query, the column value returns correctly. To work around this limitation, create a view on the OPENROWSET
query, and then query the view to get the partitioned column values to return correctly.
You may encounter the following errors when querying an external Delta table:
Msg 2571, Level 14, State 3, Line 1
User '<user>' does not have permission to run DBCC TRACEON.
Msg 16513, Level 16, State 0, Line 1
Error reading external metadata.
This can happen because there is a QUERYTRACEON
query hint that can be added to Delta file metadata query and that requires sysadmin
server role to execute. If this occurs, you can resolve the issue by globally enabling trace flag 14073 and this will prevent the query hint from being added.