Troubleshooting Hash Indexes for Memory-Optimized Tables
Applies to: SQL Server Azure SQL Database Azure SQL Managed Instance
Prerequisite
Important context information for understanding this article is available at:
Practical numbers
When creating a hash index for a memory-optimized table, the number of buckets needs to be specified at create time. In most cases the bucket count would ideally be between 1 and 2 times the number of distinct values in the index key.
However, even if the BUCKET_COUNT is moderately below or above the preferred range, the performance of your hash index is likely to be tolerable or acceptable.
At minimum, consider giving your hash index a BUCKET_COUNT roughly equal to the number of rows you predict your memory-optimized table will grow to have.
Suppose your growing table has 2,000,000 rows, but the prediction is it'll grow 10 times to 20,000,000 rows. Start with a bucket count that is 10 times the number of rows in the table. This gives you room for an increased quantity of rows.
- Ideally you would increase the bucket count when the quantity of rows reaches the initial bucket count.
- Even if the quantity of rows grows to five times larger than the bucket count, the performance is still good in most situations.
Suppose a hash index has 10,000,000 distinct key values.
- A bucket count of 2,000,000 would be about as low as you could accept. The degree of performance degradation could be tolerable.
Too many duplicate values in the index?
If the hash indexed values have a high rate of duplicates, the hash buckets suffer longer chains.
Assume you have the same SupportEvent table from the earlier T-SQL syntax code block. The following T-SQL code demonstrates how you can find and display the ratio of all values to unique values:
-- Calculate ratio of: Rows / Unique_Values.
DECLARE @allValues float(8) = 0.0, @uniqueVals float(8) = 0.0;
SELECT @allValues = Count(*) FROM SupportEvent;
SELECT @uniqueVals = Count(*) FROM
(SELECT DISTINCT SupportEngineerName
FROM SupportEvent) as d;
-- If (All / Unique) >= 10.0, use a nonclustered index, not a hash.
SELECT Cast((@allValues / @uniqueVals) as float) as [All_divby_Unique];
go
- A ratio of 10.0 or higher means a hash would be a poor type of index. Consider using a nonclustered index instead,
Troubleshooting hash index bucket count
This section discusses how to troubleshoot the bucket count for your hash index.
Monitor statistics for chains and empty buckets
You can monitor the statistical health of your hash indexes by running the following T-SQL SELECT. The SELECT uses the data management view (DMV) named sys.dm_db_xtp_hash_index_stats.
SELECT
QUOTENAME(SCHEMA_NAME(t.schema_id)) + N'.' + QUOTENAME(OBJECT_NAME(h.object_id)) as [table],
i.name as [index],
h.total_bucket_count,
h.empty_bucket_count,
FLOOR((
CAST(h.empty_bucket_count as float) /
h.total_bucket_count) * 100)
as [empty_bucket_percent],
h.avg_chain_length,
h.max_chain_length
FROM
sys.dm_db_xtp_hash_index_stats as h
JOIN sys.indexes as i
ON h.object_id = i.object_id
AND h.index_id = i.index_id
JOIN sys.memory_optimized_tables_internal_attributes ia ON h.xtp_object_id=ia.xtp_object_id
JOIN sys.tables t on h.object_id=t.object_id
WHERE ia.type=1
ORDER BY [table], [index];
Compare the SELECT results to the following statistical guidelines:
- Empty buckets:
- 33% is a good target value, but a larger percentage (even 90%) is usually fine.
- When the bucket count equals the number of distinct key values, approximately 33% of the buckets are empty.
- A value below 10% is too low.
- Chains within buckets:
- An average chain length of 1 is ideal in case there are no duplicate index key values. Chain lengths up to 10 are usually acceptable.
- If the average chain length is greater than 10, and the empty bucket percent is greater than 10%, the data has so many duplicates that a hash index might not be the most appropriate type.
Demonstration of chains and empty buckets
The following T-SQL code block gives you an easy way to test a SELECT * FROM sys.dm_db_xtp_hash_index_stats;
. The code block completes in 1 minute. Here are the phases of the following code block:
- Creates a memory-optimized table that has a few hash indexes.
- Populates the table with thousands of rows.
a. A modulo operator is used to configure the rate of duplicate values in the StatusCode column.
b. The loop inserts 262,144 rows in approximately 1 minute. - PRINTs a message asking you to run the earlier SELECT from sys.dm_db_xtp_hash_index_stats.
DROP TABLE IF EXISTS SalesOrder_Mem;
go
CREATE TABLE SalesOrder_Mem
(
SalesOrderId uniqueidentifier NOT NULL DEFAULT newid(),
OrderSequence int NOT NULL,
OrderDate datetime2(3) NOT NULL,
StatusCode tinyint NOT NULL,
PRIMARY KEY NONCLUSTERED
HASH (SalesOrderId) WITH (BUCKET_COUNT = 262144),
INDEX ix_OrderSequence
HASH (OrderSequence) WITH (BUCKET_COUNT = 20000),
INDEX ix_StatusCode
HASH (StatusCode) WITH (BUCKET_COUNT = 8),
INDEX ix_OrderDate NONCLUSTERED (OrderDate DESC)
)
WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)
go
--------------------
SET NOCOUNT ON;
-- Same as PK bucket_count. 68 seconds to complete.
DECLARE @i int = 262144;
BEGIN TRANSACTION;
WHILE @i > 0
BEGIN
INSERT SalesOrder_Mem
(OrderSequence, OrderDate, StatusCode)
Values
(@i, GetUtcDate(), @i % 8); -- Modulo technique.
SET @i -= 1;
END
COMMIT TRANSACTION;
PRINT 'Next, you should query: sys.dm_db_xtp_hash_index_stats .';
go
The preceding INSERT
loop does the following:
- Inserts unique values for the primary key index, and for ix_OrderSequence.
- Inserts a couple hundred thousands rows which represent only eight distinct values for
StatusCode
. Therefore there is a high rate of value duplication in index ix_StatusCode.
For troubleshooting when the bucket count is not optimal, examine the following output of the SELECT from sys.dm_db_xtp_hash_index_stats. For these results we added WHERE Object_Name(h.object_id) = 'SalesOrder_Mem'
to the SELECT copied from section D.1.
Our SELECT
results are displayed after the code, artificially split into two narrower results tables for better display.
- Here are the results for bucket count.
IndexName | total_bucket_count | empty_bucket_count | EmptyBucketPercent |
---|---|---|---|
ix_OrderSequence | 32768 | 13 | 0 |
ix_StatusCode | 8 | 4 | 50 |
PK_SalesOrd_B14003... | 262144 | 96525 | 36 |
- Next are the results for chain length.
IndexName | avg_chain_length | max_chain_length |
---|---|---|
ix_OrderSequence | 8 | 26 |
ix_StatusCode | 65536 | 65536 |
PK_SalesOrd_B14003... | 1 | 8 |
Let us interpret the preceding results tables for the three hash indexes:
ix_StatusCode:
- 50% of the buckets are empty, which is good.
- However, the average chain length is very high at 65536.
- This indicates a high rate of duplicate values.
- Therefore, using a hash index is not appropriate in this case. A nonclustered index should be used instead.
ix_OrderSequence:
- 0% of the buckets are empty, which is too low.
- The average chain length is 8, even though all values in this index are unique.
- Therefore the bucket count should be increased, to reduce the average chain length closer to 2 or 3.
- Because the index key has 262144 unique values, the bucket count should be at least 262144.
- If future growth is expected, the bucket count should be higher.
Primary key index (PK_SalesOrd_...):
- 36% of the buckets are empty, which is good.
- The average chain length is 1, which is also good. No change is needed.
Balancing the trade-off
OLTP workloads focus on individual rows. Full table scans are not usually in the performance critical path for OLTP workloads. Therefore, the trade-off you must balance is between quantity of memory utilization versus performance of equality tests and insert operations.
If memory utilization is the bigger concern:
- Choose a bucket count close to the number of unique index key values.
- The bucket count should not be significantly lower than the number of unique index key values, as this impacts most DML operations as well the time it takes to recover the database after server restart.
If performance of equality tests is the bigger concern:
- A higher bucket count, of two or three times the number of unique index values, is appropriate. A higher count means:
- Faster retrievals when looking for one specific value.
- An increased memory utilization.
- An increase in the time required for a full scan of the hash index.
Additional reading
Hash Indexes for Memory-Optimized Tables
Nonclustered Indexes for Memory-Optimized Tables