Tutorial: Query HDFS in a SQL Server big data cluster
Applies to: SQL Server 2019 (15.x)
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
This tutorial demonstrates how to Query HDFS data in a SQL Server 2019 Big Data Clusters.
In this tutorial, you learn how to:
- Create an external table pointing to HDFS data in a big data cluster.
- Join this data with high-value data in the master instance.
Tip
If you prefer, you can download and run a script for the commands in this tutorial. For instructions, see the Data virtualization samples on GitHub.
This 7-minute video walks you through querying HDFS data in a big data cluster:
Prerequisites
- Big data tools
- kubectl
- Azure Data Studio
- SQL Server 2019 extension
- Load sample data into your big data cluster
Create an external table to HDFS
The storage pool contains web clickstream data in a CSV file stored in HDFS. Use the following steps to define an external table that can access the data in that file.
In Azure Data Studio, connect to the SQL Server master instance of your big data cluster. For more information, see Connect to the SQL Server master instance.
Double-click on the connection in the Servers window to show the server dashboard for the SQL Server master instance. Select New Query.
Run the following Transact-SQL command to change the context to the Sales database in the master instance.
USE Sales GO
Define the format of the CSV file to read from HDFS. Press F5 to run the statement.
CREATE EXTERNAL FILE FORMAT csv_file WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', FIRST_ROW = 2, USE_TYPE_DEFAULT = TRUE) );
Create an external data source to the storage pool if it does not already exist.
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool') BEGIN CREATE EXTERNAL DATA SOURCE SqlStoragePool WITH (LOCATION = 'sqlhdfs://controller-svc/default'); END
Create an external table that can read the
/clickstream_data
from the storage pool. The SqlStoragePool is accessible from the master instance of a big data cluster.CREATE EXTERNAL TABLE [web_clickstreams_hdfs] ("wcs_click_date_sk" BIGINT , "wcs_click_time_sk" BIGINT , "wcs_sales_sk" BIGINT , "wcs_item_sk" BIGINT , "wcs_web_page_sk" BIGINT , "wcs_user_sk" BIGINT) WITH ( DATA_SOURCE = SqlStoragePool, LOCATION = '/clickstream_data', FILE_FORMAT = csv_file ); GO
Query the data
Run the following query to join the HDFS data in the web_clickstream_hdfs
external table with the relational data in the local Sales
database.
SELECT
wcs_user_sk,
SUM( CASE WHEN i_category = 'Books' THEN 1 ELSE 0 END) AS book_category_clicks,
SUM( CASE WHEN i_category_id = 1 THEN 1 ELSE 0 END) AS [Home & Kitchen],
SUM( CASE WHEN i_category_id = 2 THEN 1 ELSE 0 END) AS [Music],
SUM( CASE WHEN i_category_id = 3 THEN 1 ELSE 0 END) AS [Books],
SUM( CASE WHEN i_category_id = 4 THEN 1 ELSE 0 END) AS [Clothing & Accessories],
SUM( CASE WHEN i_category_id = 5 THEN 1 ELSE 0 END) AS [Electronics],
SUM( CASE WHEN i_category_id = 6 THEN 1 ELSE 0 END) AS [Tools & Home Improvement],
SUM( CASE WHEN i_category_id = 7 THEN 1 ELSE 0 END) AS [Toys & Games],
SUM( CASE WHEN i_category_id = 8 THEN 1 ELSE 0 END) AS [Movies & TV],
SUM( CASE WHEN i_category_id = 9 THEN 1 ELSE 0 END) AS [Sports & Outdoors]
FROM [dbo].[web_clickstreams_hdfs]
INNER JOIN item it ON (wcs_item_sk = i_item_sk
AND wcs_user_sk IS NOT NULL)
GROUP BY wcs_user_sk;
GO
Clean up
Use the following command to remove the external table used in this tutorial.
DROP EXTERNAL TABLE [dbo].[web_clickstreams_hdfs];
GO
Next steps
Advance to the next article to learn how to query Oracle from a big data cluster.