Virtualize parquet file in a S3-compatible object storage with PolyBase
Applies to: SQL Server 2022 (16.x)
SQL Server 2022 (16.x) can virtualize data from parquet files. This process allows the data to stay in its original location, but can be queried from a SQL Server instance with T-SQL commands, like any other table. This feature uses PolyBase connectors, and minimizes the need for ETL processes.
In the example below, we'll virtualize a parquet file stored on S3-compatible object storage.
For more information on data virtualization, see Introducing data virtualization with PolyBase.
To use the S3-compatible object storage integration features, you will need the following tools and resources:
- Install the PolyBase feature for SQL Server.
- Install SQL Server Management Studio (SSMS) or Azure Data Studio.
- S3-compatible storage.
- An S3 bucket created. Buckets cannot be created or configured from SQL Server.
- A user (
Access Key ID) has been configured and the secret (
Secret Key ID) and that user is known to you. You will need both to authenticate against the S3 object storage endpoint.
- ListBucket permission on S3 user.
- ReadOnly permission on S3 user.
- TLS must have been configured. It is assumed that all connections will be securely transmitted over HTTPS not HTTP. The endpoint will be validated by a certificate installed on the SQL Server OS Host.
In order for the proxy user to read the content of an S3 bucket the user will need to be allowed to perform the following actions against the S3 endpoint:
- Enable PolyBase in
exec sp_configure @configname = 'polybase enabled', @configvalue = 1 ; RECONFIGURE ; exec sp_configure @configname = 'polybase enabled' ;
- Before you create a database scoped credential, the user database must have a master key to protect the credential. For more information, see CREATE MASTER KEY.
Create a Database Scoped Credential
The following sample script creates a database scoped credential
s3-dc in the source user database in SQL Server. For more information, see CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL).
IF NOT EXISTS(SELECT * FROM sys.credentials WHERE name = 's3_dc') BEGIN CREATE DATABASE SCOPED CREDENTIAL s3_dc WITH IDENTITY = 'S3 Access Key', SECRET = '<AccessKeyID>:<SecretKeyID>' ; END
Verify the new database-scoped credential with sys.database_scoped_credentials (Transact-SQL):
SELECT * FROM sys.database_scoped_credentials;
Create an external data source
The following sample script creates an external data source
s3_ds in the source user database in SQL Server. The external data source references the
s3_dc database scoped credential. For more information, see CREATE EXTERNAL DATA SOURCE.
CREATE EXTERNAL DATA SOURCE s3_ds WITH ( LOCATION = 's3://<ip_address>:<port>/' , CREDENTIAL = s3_dc ); GO
Verify the new external data source with sys.external_data_sources.
SELECT * FROM sys.external_data_sources;
SELECT from a parquet file using OPENROWSET
The following example demonstrates using T-SQL to query a parquet file stored in S3-compatible object storage via OPENROWSET query. For more information, see OPENROWSET (Transact-SQL).
As this is a parquet file, two important things are happening automatically:
- SQL Server will read the schema from the file itself, so there is no need to define the table, columns, or data types.
- There is no need to declare the type of compression for the file to be read.
SELECT * FROM OPENROWSET ( BULK '/<bucket>/<parquet_folder>' , FORMAT = 'PARQUET' , DATA_SOURCE = 's3_ds' ) AS [cc];
Query S3-compatible object storage via external table
The following example demonstrates using T-SQL to query a parquet file stored in S3-compatible object storage via querying external table. The sample uses a relative path within the external data source.
CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH(FORMAT_TYPE = PARQUET); GO CREATE EXTERNAL TABLE Region ( r_regionkey BIGINT, r_name CHAR(25), r_comment VARCHAR(152) ) WITH (LOCATION = '/region/', DATA_SOURCE = 's3_ds', FILE_FORMAT = ParquetFileFormat); GO SELECT * FROM [Region];
For more information, see:
- SQL Server queries on an external table backed by S3-compatible storage are limited to 1000 objects per prefix. This is because S3-compatible object listing is limited to 1000 object keys per prefix.
- For S3-compatible object storage, customers are not allowed to create their access key ID with a
:character in it.
- The total URL length is limited to 259 characters. This means
s3://<hostname>/<objectkey>shouldn't exceed 259 characters. The
s3://counts towards this limit, so the path length cannot exceed 259-5 = 254 characters.
- The SQL credential name is limited by 128 characters in UTF-16 format.
- The credential name created must contain the bucket name unless this credential is for a new external data source.
- Access Key ID and Secret Key ID must only contain alphanumeric values.
- To learn more about PolyBase, see Overview of SQL Server PolyBase
- Configure PolyBase to access external data in S3-compatible object storage