Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to: ✅ SQL analytics endpoint and Warehouse in Microsoft Fabric
Data clustering in Fabric Data Warehouse organizes data for faster query performance and reduced compute usage. This tutorial walks through the steps to create tables with data clustering, from creating clustered tables to checking their effectiveness.
Prerequisites
- A Microsoft Fabric tenant account with an active subscription.
- Make sure you have a Microsoft Fabric enabled Workspace: Create a workspace.
- Make sure you have already created a Warehouse. To create a new Warehouse, refer to Create a Warehouse in Microsoft Fabric.
- Basic understanding of T-SQL and querying data.
Import sample data
This tutorial uses the NY Taxi sample data set. To import the NY Taxi data into your Warehouse. Use the Load Sample data to Data Warehouse tutorial tutorial.
Create a table with data clustering
For this tutorial, we need two copies of the NYTaxi table: the regular copy of the table as imported from the tutorial, and a copy that uses data clustering. Use the following command to create a new table using CREATE TABLE AS SELECT (CTAS), based on the original NYTaxi table:
CREATE TABLE nyctlc_With_DataClustering
WITH (CLUSTER BY (lpepPickupDatetime))
AS SELECT * FROM nyctlc
Note
The example assumes the table name given to the NY Taxi dataset in the Load Sample data to Data Warehouse tutorial. If you used a different name for your table, adjust the command to replace nyctlc with your table name.
This command creates an exact copy of the original NYTaxi table, but with data clustering on the lpepPickupDatetime column. Next, we use this column for querying.
Query data
Run a query on the NYTaxi table, and repeat the exact same query on the NYTaxi_With_DataClustering table for comparison.
Note
For this analysis, it's beneficial to look at the cold cache performance of both runs – that is, without using the caching features of Fabric Data Warehouse. Therefore, run each query exactly once before you look at the results in Query Insights.
We use a query that is often repeated in the Warehouse. This query computes the average fare amount by year between the dates 2008-12-31 and 2014-06-30:
SELECT
YEAR(lpepPickupDatetime),
AVG(fareAmount) as [Average Fare]
FROM
NYTaxi
WHERE
lpepPickupDatetime BETWEEN '2008-12-31' AND '2014-06-30'
GROUP BY
YEAR(lpepPickupDatetime)
ORDER BY
YEAR(lpepPickupDatetime) DESC
OPTION (LABEL = 'Regular');
Note
The label option used in this query is useful when we compare the query details of the Regular table against the one that uses data clustering later using Query Insights views.
Next, we repeat the exact same query, but on the version of the table that uses data clustering:
SELECT
YEAR(lpepPickupDatetime),
AVG(fareAmount) as [Average Fare]
FROM
NYTaxi_With_DataClustering
WHERE
lpepPickupDatetime BETWEEN '2008-12-31' AND '2014-06-30'
GROUP BY
YEAR(lpepPickupDatetime)
ORDER BY
YEAR(lpepPickupDatetime) DESC
OPTION (LABEL = 'Clustered');
The second query uses the label Clustered to allow us to identify this query later with Query Insights.
Check the effectiveness of data clustering
After setting up clustering, you can assess its effectiveness using Query Insights. Query Insights in Fabric Data Warehouse captures historical query execution data and aggregates it into actionable insights, such as identifying long-running or frequently executed queries.
In this case, we use Query Insights to compare difference in data scanned between the regular and the clustered cases.
Use the following query:
SELECT
label,
submit_time,
row_count,
total_elapsed_time_ms,
allocated_cpu_time_ms,
result_cache_hit,
data_scanned_disk_mb,
data_scanned_memory_mb,
data_scanned_remote_storage_mb,
command
FROM
queryinsights.exec_requests_history
WHERE
command LIKE '%NYTaxi%'
AND label IN ('Regular','Clustered')
ORDER BY
submit_time DESC;
This query fetches details from the exec_requests_history view. For more information, see queryinsights.exec_requests_history (Transact-SQL).
The query filters the results the following ways:
- Fetches only rows that contain the
NYTaxitext in the command name (as was used in the test queries) - Fetches only rows where the label value was either regular or clustered
Note
It might take a few minutes for your query details to become available in Query Insights. If your Query Insights query returns no results, try again after a few minutes.
Running this query, we observe the following results:
Both queries have a row count of 6 and similar submit times. The Clustered query shows total_elapsed_time_ms of 1794, allocated_cpu_time_ms of 1676, and data_scanned_remote_storage_mb of 77.519. The Regular query shows total_elapsed_time_ms of 2651, allocated_cpu_time_ms of 2600, and data_scanned_remote_storage_mb of 177.700. These numbers demonstrate that even though both queries returned the same results, the Clustered version used approximately 36% less CPU time than the Regular version and scanned approximately 56% less data on disk. No cache was used in either query run. These are significant results to help reduce query execution time and consumption usage, and make the lpepPickupDatetime column a strong candidate for data clustering.
Note
This is a small table, with approximately 76 million rows and 2GB of data volume. Even though this query returns only six rows on its aggregation (one for each year in the range), it scans approximately 8.3 million rows in the date range provided before results are aggregated. Actual production data with larger data volumes can provide more significant results. Your results might vary based on the capacity size, cached results, or concurrency during the queries.