Virtualize parquet file in a S3-compatible object storage with PolyBase

Статия
06/03/2024

Applies to: SQL Server 2022 (16.x)

SQL Server 2022 (16.x) can virtualize data from parquet files. This process allows the data to stay in its original location, but can be queried from a SQL Server instance with T-SQL commands, like any other table. This feature uses PolyBase connectors, and minimizes the need for extract, transform, and load (ETL) processes.

In the following example, we'll virtualize a parquet file stored on S3-compatible object storage.

For more information on data virtualization, see Introducing data virtualization with PolyBase.

Prerequisites

To use the S3-compatible object storage integration features, you need the following tools and resources:

Install the PolyBase feature for SQL Server.
Install SQL Server Management Studio (SSMS) or Azure Data Studio.
S3-compatible storage.
An S3 bucket created. Buckets cannot be created or configured from SQL Server.
A user (Access Key ID) and the secret (Secret Key ID) and that user is known to you. You will need both to authenticate against the S3 object storage endpoint.
ListBucket permission on S3 user.
ReadOnly permission on S3 user.
TLS must have been configured. It is assumed that all connections will be securely transmitted over HTTPS not HTTP. The endpoint will be validated by a certificate installed on the SQL Server OS Host.

Permission

In order for the proxy user to read the content of an S3 bucket the user needs to be allowed to perform the following actions against the S3 endpoint:

ListBucket;
ReadOnly;

Pre-configuration

Enable PolyBase in sp_configure:

exec sp_configure @configname = 'polybase enabled', @configvalue = 1
;
RECONFIGURE
;
exec sp_configure @configname = 'polybase enabled'
;

Before you create a database scoped credential, the user database must have a master key to protect the credential. For more information, see CREATE MASTER KEY.

Create a Database Scoped Credential

The following sample script creates a database scoped credential s3-dc in the source user database in SQL Server. For more information, see CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL).

IF NOT EXISTS(SELECT * FROM sys.credentials WHERE name = 's3_dc')
BEGIN
 CREATE DATABASE SCOPED CREDENTIAL s3_dc
 WITH IDENTITY = 'S3 Access Key',
 SECRET = '<AccessKeyID>:<SecretKeyID>' ;
END

Verify the new database-scoped credential with sys.database_scoped_credentials (Transact-SQL):

SELECT * FROM sys.database_scoped_credentials;

Create an external data source

The following sample script creates an external data source s3_ds in the source user database in SQL Server. The external data source references the s3_dc database scoped credential. For more information, see CREATE EXTERNAL DATA SOURCE.

CREATE EXTERNAL DATA SOURCE s3_ds
WITH
(   LOCATION = 's3://<ip_address>:<port>/'
,   CREDENTIAL = s3_dc
);
GO

Verify the new external data source with sys.external_data_sources.

SELECT * FROM sys.external_data_sources;

Virtual hosted URLs

Some S3-compatible storage systems (such as Amazon Web Services) utilize virtual_hosted style URLs to implement folder structure in the S3 bucket. Add the following CONNECTION_OPTIONS to allow for creation of external tables pointing to folder locations in the S3 bucket, for example CONNECTION_OPTIONS = '{"s3":{"url_style":"virtual_hosted"}}'.

Without that CONNECTION_OPTIONS setting, when querying external tables pointing to a folder, you might observe the following error:

Msg 13807, Level 16, State 1, Line 23  
Content of directory on path '/<folder_name>/' cannot be listed.

SELECT from a parquet file using OPENROWSET

The following example demonstrates using T-SQL to query a parquet file stored in S3-compatible object storage via OPENROWSET query. For more information, see OPENROWSET (Transact-SQL).

As this is a parquet file, two important things are happening automatically:

SQL Server reads the schema from the file itself, so there is no need to define the table, columns, or data types.
There is no need to declare the type of compression for the file to be read.

SELECT  * 
FROM    OPENROWSET
        (   BULK '/<bucket>/<parquet_folder>'
        ,   FORMAT       = 'PARQUET'
        ,   DATA_SOURCE  = 's3_ds'
        ) AS [cc];

Query S3-compatible object storage via external table

The following example demonstrates using T-SQL to query a parquet file stored in S3-compatible object storage via querying external table. The sample uses a relative path within the external data source.

CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH(FORMAT_TYPE = PARQUET);
GO

CREATE EXTERNAL TABLE Region (
r_regionkey BIGINT,
r_name CHAR(25),
r_comment VARCHAR(152) )
WITH (LOCATION = '/region/', DATA_SOURCE = 's3_ds', 
FILE_FORMAT = ParquetFileFormat);
GO

SELECT * FROM [Region];

For more information, see:

Limitations

SQL Server queries on an external table backed by S3-compatible storage are limited to 1,000 objects per prefix. This is because S3-compatible object listing is limited to 1,000 object keys per prefix.
For S3-compatible object storage, customers are not allowed to create their access key ID with a : character in it.
The total URL length is limited to 259 characters. This means s3://<hostname>/<objectkey> shouldn't exceed 259 characters. The s3:// counts towards this limit, so the path length cannot exceed 259-5 = 254 characters.
The SQL credential name is limited by 128 characters in UTF-16 format.
The credential name created must contain the bucket name unless this credential is for a new external data source.
Access Key ID and Secret Key ID must only contain alphanumeric values.

Next steps

To learn more about PolyBase, see Overview of SQL Server PolyBase
Configure PolyBase to access external data in S3-compatible object storage

Споделяне чрез