評估和更正專用 SQL 集區中的叢集數據行存放區索引健康情況
適用於:Azure Synapse Analytics
本文介紹評估叢集數據行存放區索引 (CCI) 健康情況的稍微不同方法。 請遵循下列各節中的步驟,或透過 Azure Data Studio 在筆記本中執行步驟。
注意事項
嘗試開啟此筆記本之前,請確定已在本機計算機上安裝 Azure Data Studio。 若要安裝它,請移 至瞭解如何安裝 Azure Data Studio。
一般而言,有兩個主要因素會影響CCI的品質:
精簡數據列群組和元數據 - 實際的數據列群組計數接近數據列群組中數據列數目的理想計數。
壓縮的數據列群組 - 資料列群組使用數據行存放區壓縮。
其他條件,例如小型數據表、過度分割的數據表或分割區不足的數據表,可能就是品質不佳或健康情況不佳。 不過,這些條件會更妥善地分類為可在 步驟 4 中評估的設計改進機會。
步驟 1:分析 CCI 健康情況的摘要
使用下列查詢來取得單一計量數據列。
WITH cci_detail AS (
SELECT t.object_id,
rg.partition_number,
COUNT(*) AS total_rowgroup_count,
SUM(CASE WHEN rg.state = 1 THEN 1 END) AS open_rowgroup_count,
CEILING((SUM(rg.[total_rows]) - SUM(rg.deleted_rows))/COUNT(DISTINCT rg.distribution_id)/1048576.) * COUNT(DISTINCT rg.distribution_id) AS [ideal_rowgroup_count],
SUM(rg.size_in_bytes/1024/1024.) AS size_in_mb,
SUM(CASE WHEN rg.state = 1 THEN rg.size_in_bytes END /1024/1024.) AS open_size_in_mb
FROM sys.pdw_nodes_column_store_row_groups rg
JOIN sys.pdw_nodes_tables nt ON rg.object_id = nt.object_id
AND rg.pdw_node_id = nt.pdw_node_id
AND rg.distribution_id = nt.distribution_id
JOIN sys.pdw_table_mappings mp ON nt.name = mp.physical_name
JOIN sys.tables t ON mp.object_id = t.object_id
GROUP BY t.object_id,
rg.partition_number
)
SELECT COUNT(DISTINCT object_id) AS tables_assessed_count,
COUNT(*) AS partitions_assessed_count,
SUM(total_rowgroup_count) AS actual_rowgroup_count,
SUM(ideal_rowgroup_count) AS ideal_rowgroup_count,
SUM(open_rowgroup_count) AS uncompressed_rowgroup_count,
CAST(SUM(size_in_mb) AS DECIMAL(19, 4)) AS actual_size_in_mb,
CAST(SUM(open_size_in_mb) AS DECIMAL(19, 4)) AS uncompressed_size_in_mb,
CAST(((SUM(total_rowgroup_count) - SUM(ideal_rowgroup_count)) / SUM(total_rowgroup_count)) * 100. AS DECIMAL(9, 4)) AS excess_pct,
CAST(((SUM(total_rowgroup_count) - SUM(ideal_rowgroup_count)) / SUM(total_rowgroup_count)) * 1. AS DECIMAL(9, 4)) * SUM(size_in_mb) AS excess_size_in_mb
FROM cci_detail
從結果中,您可以取得專用 SQL 集區的 CCI 健康情況概觀。 這項資訊無法直接採取動作,但可協助您了解維護例程的重要性,以達到理想的狀態。
欄名稱 | 描述 |
---|---|
tables_assessed_count |
CCI 數據表計數 |
partitions_assessed_count |
數據分割計數 注意: 非數據分割數據表會計算為 1。 |
actual_rowgroup_count |
數據列群組的實體計數 |
ideal_rowgroup_count |
適用於數據列數目的計算數據列群組數目 |
uncompressed_rowgroup_count |
包含未壓縮數據的數據列群組數目。 (也稱為:OPEN 數據列) |
actual_size_in_mb |
以 MB 為單位的 CCI 數據實體大小 |
uncompressed_size_in_mb |
以 MB 為單位之未壓縮數據的實體大小 |
excess_pct |
可進一步優化的數據列群組百分比 |
excess_size_in_mb |
來自未優化數據列群組的估計 MB |
步驟 2:分析詳細的 CCI 資訊
下列查詢提供詳細的報表,說明哪些數據表分割區是重建的候選專案。 CCI 詳細數據提供於三個計量中,可協助識別並排定最適合維護的數據表/數據分割的優先順序。 在 子句中 WHERE
設定這些計量的適當臨界值,然後在 子句中 ORDER BY
,使用您最感興趣的計量。 詳細資訊也有助於判斷您的專用 SQL 集區是否受到大量小型分散數據表的影響,這可能會導致 編譯延遲。
注意事項
批註 fnMs_GenerateIndexMaintenanceScript
函式是TVF) (資料表值函式,可產生維護索引的通用腳本。 如果您想要取得結果中的維護腳本,請取消批註第 37 行和第 39 行。 在執行查詢之前,請先使用產生 索引維護腳本 一節中的腳本來建立函式。 執行您從結果取得的維護腳本時,請務必使用適當大小的 資源類別,例如 largerc 或 xlargerc。
資料行名稱 | 品質特性 | 描述 |
---|---|---|
excess_pct |
壓 實度 | 可進一步壓縮的數據列群組百分比 |
excess_size_in_mb |
壓 實度 | 來自未優化數據列群組的估計 MB |
OPEN_rowgroup_size_in_mb |
壓縮 | 索引中未壓縮數據的實際 MB |
WITH cci_info AS(
SELECT t.object_id AS [object_id],
MAX(schema_name(t.schema_id)) AS [schema_name],
MAX(t.name) AS [table_name],
rg.partition_number AS [partition_number],
COUNT(DISTINCT rg.distribution_id) AS [distribution_count],
SUM(rg.size_in_bytes/1024/1024) AS [size_in_mb],
SUM(rg.[total_rows]) AS [row_count_total],
COUNT(*) AS [total_rowgroup_count],
CEILING((SUM(rg.[total_rows]) - SUM(rg.[deleted_rows]))/COUNT(DISTINCT rg.distribution_id)/1048576.) * COUNT(DISTINCT rg.distribution_id) AS [ideal_rowgroup_count],
SUM(CASE WHEN rg.[State] = 1 THEN 1 ELSE 0 END) AS [OPEN_rowgroup_count],
SUM(CASE WHEN rg.[State] = 1 THEN rg.[total_rows] ELSE 0 END) AS [OPEN_rowgroup_rows],
CAST(SUM(CASE WHEN rg.[State] = 1 THEN rg.[size_in_bytes]/1024./1024. ELSE 0 END) AS DECIMAL(19, 4)) AS [OPEN_rowgroup_size_in_mb],
SUM(CASE WHEN rg.[State] = 2 THEN 1 ELSE 0 END) AS [CLOSED_rowgroup_count],
SUM(CASE WHEN rg.[State] = 2 THEN rg.[total_rows] ELSE 0 END) AS [CLOSED_rowgroup_rows],
CAST(SUM(CASE WHEN rg.[State] = 2 THEN rg.[size_in_bytes]/1024./1024. ELSE 0 END) AS DECIMAL(19, 4)) AS [CLOSED_size_in_mb],
SUM(CASE WHEN rg.[State] = 3 THEN 1 ELSE 0 END) AS [COMPRESSED_rowgroup_count],
SUM(CASE WHEN rg.[State] = 3 THEN rg.[total_rows] ELSE 0 END) AS [COMPRESSED_rowgroup_rows],
CAST(SUM(CASE WHEN rg.[State] = 3 THEN rg.[size_in_bytes]/1024./1024. ELSE 0 END) AS DECIMAL(19, 4)) AS [COMPRESSED_size_in_mb],
SUM(CASE WHEN rg.[State] = 3 THEN rg.[deleted_rows] ELSE 0 END) AS [COMPRESSED_rowgroup_rows_DELETED]
FROM sys.[pdw_nodes_column_store_row_groups] rg
JOIN sys.[pdw_nodes_tables] nt ON rg.[object_id] = nt.[object_id]
AND rg.[pdw_node_id] = nt.[pdw_node_id]
AND rg.[distribution_id] = nt.[distribution_id]
JOIN sys.[pdw_table_mappings] mp ON nt.[name] = mp.[physical_name]
JOIN sys.[tables] t ON mp.[object_id] = t.[object_id]
GROUP BY t.object_id,
rg.partition_number
)
, calc_excess AS(
SELECT *,
CAST(((total_rowgroup_count - ideal_rowgroup_count) / total_rowgroup_count) * 100. AS DECIMAL(9, 4)) AS [excess_pct],
CAST(((total_rowgroup_count - ideal_rowgroup_count) / total_rowgroup_count) * 1. AS DECIMAL(9, 4)) * size_in_mb AS [excess_size_in_mb]
FROM cci_info
)
SELECT calc_excess.*
-- , script.*
FROM calc_excess
-- CROSS APPLY dbo.fnMs_GenerateIndexMaintenanceScript(object_id, partition_number) AS script
WHERE -- set your own threshold(s) for the following; 0 is the ideal, but usually not practical
calc_excess.[excess_size_in_mb] > 300
OR calc_excess.excess_pct > 0.1
OR calc_excess.OPEN_rowgroup_size_in_mb > 100
ORDER BY calc_excess.[excess_size_in_mb] DESC;
步驟 3:維護無法改善 CCI 健康情況時該怎麼辦
在資料表/分割區上執行維護可能會導致下列其中一個案例:
excess_pct
或excess_size_in_mb
大於維護之前的大小。- 維護語句失敗,記憶體不足。
一般原因
- 資源不足。
- DWU) (服務等級不足。
- 數據表很大且未分割。
建議的緩和措施
- 變更執行使用者的資源類別或 工作負載群組,以增加維護語句的資源。
- 暫時增加 DWU 層級以執行維護。
- 針對有問題的數據表實作數據分割策略,然後對數據分割執行維護。
步驟 4:檢查設計改進機會
雖然不完整,但下列查詢可協助您找出常見導致CCI效能或維護問題的潛在機會。
商機標題 | 描述 | 建議 |
---|---|---|
小型數據表 | 數據表包含少於 1500 萬個數據列 | 請考慮將索引從 CCI 變更為:
|
數據分割商機或分割不足的數據表 | 計算出的理想數據列群組計數大於 18000 萬 (或 ~18800 萬個數據列) | 實作數據分割策略或變更現有的數據分割策略,將每個分割區的數據列數目減少為小於 188M (每個散發區每個分割區大約三個數據列群組) |
過度分割的數據表 | 數據表包含的最大分割區少於1500萬個數據列 | 考量:
|
WITH cci_info AS (
SELECT t.object_id AS [object_id],
MAX(SCHEMA_NAME(t.schema_id)) AS [schema_name],
MAX(t.name) AS [table_name],
rg.partition_number AS [partition_number],
SUM(rg.[total_rows]) AS [row_count_total],
CEILING((SUM(rg.[total_rows]) - SUM(rg.[deleted_rows]))/COUNT(DISTINCT rg.distribution_id)/1048576.) * COUNT(DISTINCT rg.distribution_id) AS [ideal_rowgroup_count]
FROM sys.[pdw_nodes_column_store_row_groups] rg
JOIN sys.[pdw_nodes_tables] nt ON rg.[object_id] = nt.[object_id]
AND rg.[pdw_node_id] = nt.[pdw_node_id]
AND rg.[distribution_id] = nt.[distribution_id]
JOIN sys.[pdw_table_mappings] mp ON nt.[name] = mp.[physical_name]
JOIN sys.[tables] t ON mp.[object_id] = t.[object_id]
GROUP BY t.object_id,
rg.partition_number
)
SELECT object_id,
MAX(SCHEMA_NAME),
MAX(TABLE_NAME),
COUNT(*) AS number_of_partitions,
MAX(row_count_total) AS max_partition_row_count,
MAX(ideal_rowgroup_count) partition_ideal_row_count,
CASE
-- non-partitioned tables
WHEN COUNT(*) = 1 AND MAX(row_count_total) < 15000000 THEN 'Small table'
WHEN COUNT(*) = 1 AND MAX(ideal_rowgroup_count) > 180 THEN 'Partitioning opportunity'
-- partitioned tables
WHEN COUNT(*) > 1 AND MAX(row_count_total) < 15000000 THEN 'Over-partitioned table'
WHEN COUNT(*) > 1 AND MAX(ideal_rowgroup_count) > 180 THEN 'Under-partitioned table'
END AS warning_category
FROM cci_info
GROUP BY object_id
產生索引維護腳本
執行下列查詢,以在專用 SQL 集區上建立 dbo.fnMs_GenerateIndexMaintenanceScript
函式。 此函式會產生腳本,以三種方式優化您的CCI。 您不僅可以使用此函式來維護CCI,也可以將叢集 (資料列存放區) 索引。
參數
參數名稱 | 必要 | 描述 |
---|---|---|
@object_id |
Y | object_id 要設為目標的數據表 |
@partition_number |
Y | partition_number 從 sys.partitions 到目標。 如果數據表未分割,請指定 1。 |
輸出數據表
欄名稱 | 描述 |
---|---|
rebuild_script |
為指定的數據表/數據分割產生 ALTER INDEX ALL ... REBUILD 語句。 非資料分割堆積會傳回 NULL 。 |
reorganize_script |
為指定的數據表/數據分割產生 ALTER INDEX ALL ... REORGANIZE 語句。 非資料分割堆積會傳回 NULL 。 |
partition_switch_script |
僅適用於數據分割數據表; NULL 如果資料表未分割,或指定了無效的資料分割編號, 則會是 。 如果使用 子句建立 ORDER CCI,則會轉譯它。 |
CREATE FUNCTION dbo.fnMs_GenerateIndexMaintenanceScript (@object_id INT, @partition_number INT = 1)
RETURNS TABLE
AS
RETURN(
WITH base_info AS (
SELECT
t.object_id
, SCHEMA_NAME(t.schema_id) AS [schema_name]
, t.name AS table_name
, i.index_type
, i.index_cols
, i.index_type_desc
, tdp.distribution_policy_desc
, c.name hash_distribution_column_name
FROM sys.tables t
JOIN (
SELECT
i.object_id
, i.index_id
, MAX(i.type) AS index_type
, MAX(CASE WHEN i.type = 5 AND ic.column_store_order_ordinal != 0 THEN ' ORDER ' ELSE '' END)
+ '(' + STRING_AGG(
CASE
WHEN i.type IN (1, 5)
AND (ic.key_ordinal != 0 OR ic.column_store_order_ordinal != 0)
THEN c.name + CASE WHEN ic.is_descending_key = 1 THEN ' DESC' ELSE '' END
END
, ',') WITHIN GROUP(ORDER BY ic.column_store_order_ordinal, ic.key_ordinal) + ')' AS index_cols
, MAX(i.type_desc)
+ CASE
WHEN MAX(i.type) IN (1, 5) THEN ' INDEX'
ELSE ''
END COLLATE SQL_Latin1_General_CP1_CI_AS AS index_type_desc
FROM sys.indexes i
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
WHERE i.index_id <= 1
GROUP BY i.object_id, i.index_id
) AS i
ON t.object_id = i.object_id
JOIN sys.pdw_table_distribution_properties tdp ON t.object_id = tdp.object_id
LEFT JOIN sys.pdw_column_distribution_properties cdp ON t.object_id = cdp.object_id AND cdp.distribution_ordinal = 1
LEFT JOIN sys.columns c ON cdp.object_id = c.object_id AND cdp.column_id = c.column_id
WHERE t.object_id = @object_id
)
, param_data_type AS (
SELECT
pp.function_id
, typ.name AS data_type_name
, CAST(CASE
WHEN typ.collation_name IS NOT NULL THEN 1
WHEN typ.name LIKE '%date%' THEN 1
WHEN typ.name = 'uniqueidentifier' THEN 1
ELSE 0
END AS BIT) AS use_quotes_on_values_flag
FROM sys.partition_parameters pp
JOIN sys.types typ ON pp.user_type_id = typ.user_type_id
)
, boundary AS (
SELECT
t.object_id
, c.name AS partition_column_name
, pf.boundary_value_on_right
, prv.boundary_id
, prv.boundary_id + CASE WHEN pf.boundary_value_on_right = 1 THEN 1 ELSE 0 END AS [partition_number]
, CASE
WHEN pdt.use_quotes_on_values_flag = 1 THEN '''' + CAST(
CASE pdt.data_type_name
WHEN 'date' THEN CONVERT(char(10), prv.value, 120)
WHEN 'smalldatetime' THEN CONVERT(VARCHAR, prv.value, 120)
WHEN 'datetime' THEN CONVERT(VARCHAR, prv.value, 121)
WHEN 'datetime2' THEN CONVERT(VARCHAR, prv.value, 121)
ELSE prv.value
END
AS VARCHAR(32)) + ''''
ELSE CAST(prv.value AS VARCHAR(32))
END AS boundary_value
FROM sys.tables t
JOIN sys.indexes i ON t.object_id = i.object_id AND i.index_id <= 1
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id AND ic.partition_ordinal = 1
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.partition_schemes ps ON i.data_space_id = ps.data_space_id
JOIN sys.partition_functions pf ON ps.function_id = pf.function_id
JOIN param_data_type pdt ON pf.function_id = pdt.function_id
JOIN sys.partition_range_values prv ON pf.function_id = prv.function_id
WHERE t.object_id = @object_id
)
, partition_clause AS (
SELECT
object_id
, COUNT(*) - 1 -- should always be the 2nd to last partition in stage table
+ CASE WHEN MAX([partition_number]) = @partition_number THEN 1 ELSE 0 END -- except when last partition
AS [source_partition_number]
, 'WHERE ' + MAX(partition_column_name)
+ CASE WHEN MAX(CAST(boundary_value_on_right AS TINYINT)) = 1 THEN
' >= ' + MIN(CASE WHEN [partition_number] = @partition_number THEN boundary_value END)
ELSE
' <= ' + MAX(CASE WHEN [partition_number] = @partition_number THEN boundary_value END)
END
+ ' AND ' + MAX(partition_column_name)
+ CASE WHEN MAX(CAST(boundary_value_on_right AS TINYINT)) = 1 THEN
' < ' + MAX(boundary_value)
ELSE
' > ' + MIN(boundary_value)
END AS filter_clause
, ', PARTITION (' + MAX(partition_column_name) + ' RANGE '
+ CASE WHEN MAX(CAST(boundary_value_on_right AS TINYINT)) = 1 THEN 'RIGHT' ELSE 'LEFT' END
+ ' FOR VALUES(' + STRING_AGG(boundary_value, ',') + '))' AS [partition_clause]
FROM boundary
WHERE [partition_number] BETWEEN @partition_number - 1 AND @partition_number + 1
GROUP BY object_id
)
SELECT
CASE WHEN index_type IN (1, 5) THEN 'ALTER INDEX ALL ON [' + [schema_name] + '].[' + [table_name] + '] REBUILD'
+ CASE WHEN partition_clause.[object_id] IS NOT NULL THEN ' PARTITION = ' + CAST(@partition_number AS VARCHAR(16)) ELSE '' END + ';' END AS [rebuild_script]
, CASE WHEN index_type IN (1, 5) THEN 'ALTER INDEX ALL ON [' + [schema_name] + '].[' + [table_name] + '] REORGANIZE'
+ CASE WHEN partition_clause.[object_id] IS NOT NULL THEN ' PARTITION = ' + CAST(@partition_number AS VARCHAR(16)) ELSE '' END
+ CASE WHEN index_type = 5 THEN ' WITH (COMPRESS_ALL_ROW_GROUPS = ON)' ELSE '' END + ';' END AS [reorganize_script]
, 'CREATE TABLE [' + schema_name + '].[' + table_name + '_p' + CAST(@partition_number AS VARCHAR(16)) + '_tmp] WITH(' + index_type_desc + ISNULL(index_cols, '')
+ ', DISTRIBUTION = ' + distribution_policy_desc + CASE WHEN distribution_policy_desc = 'HASH' THEN '(' + hash_distribution_column_name + ')' ELSE '' END
+ partition_clause.partition_clause + ') AS SELECT * FROM [' + [schema_name] + '].[' + [table_name] + '] ' + filter_clause + CASE WHEN index_type = 5 AND index_cols IS NOT NULL THEN ' OPTION(MAXDOP 1)' ELSE '' END + ';'
+ ' ALTER TABLE [' + schema_name + '].[' + table_name + '_p' + CAST(@partition_number AS VARCHAR(16)) + '_tmp] SWITCH PARTITION ' + CAST(source_partition_number AS VARCHAR(16))
+ ' TO [' + [schema_name] + '].[' + [table_name] + '] PARTITION ' + CAST(@partition_number AS VARCHAR(16))
+ ' WITH (TRUNCATE_TARGET = ON);'
+ ' DROP TABLE [' + schema_name + '].[' + table_name + '_p' + CAST(@partition_number AS VARCHAR(16)) + '_tmp];' AS [partition_switch_script]
FROM base_info
LEFT JOIN partition_clause
ON base_info.object_id = partition_clause.object_id
);
GO
其他相關資訊
若要深入瞭解並取得專用 SQL 集區上 CCI 的額外評定工具,請參閱:
意見反映
https://aka.ms/ContentUserFeedback。
即將推出:我們會在 2024 年淘汰 GitHub 問題,並以全新的意見反應系統取代並作為內容意見反應的渠道。 如需更多資訊,請參閱:提交及檢視以下的意見反映: