Troubleshoot accelerated database recovery
Applies to: SQL Server 2019 (15.x) Azure SQL Database Azure SQL Managed Instance
This article helps administrators diagnose issues with accelerated database recovery (ADR) in SQL Server 2019 (15.x) and later, Azure SQL Managed Instance, and Azure SQL Database.
Examine the persistent version store (PVS)
Leverage the sys.dm_tran_persistent_version_store_stats DMV to identify if the size of the accelerated database recovery (ADR) PVS is growing larger than expected, and then to determine which factor is preventing persistent version store (PVS) cleanup.
Included in the following sample script is the column
sys.dm_tran_persistent_version_store_stats.pvs_off_row_page_skipped_oldest_aborted_xdesid, which was added in SQL Server 2022 (16.x) and contains the number of pages skipped for reclaim due to oldest aborted transactions. If the version cleaner is slow or invalidated, this will reflect how many pages must be kept for aborted transactions.
The sample query shows all information about the cleanup processes and shows the current PVS size, oldest aborted transaction, and other details:
SELECT db_name(pvss.database_id) AS DBName, pvss.persistent_version_store_size_kb / 1024. / 1024 AS persistent_version_store_size_gb, 100 * pvss.persistent_version_store_size_kb / df.total_db_size_kb AS pvs_pct_of_database_size, df.total_db_size_kb/1024./1024 AS total_db_size_gb, pvss.online_index_version_store_size_kb / 1024. / 1024 AS online_index_version_store_size_gb, pvss.current_aborted_transaction_count, pvss.aborted_version_cleaner_start_time, pvss.aborted_version_cleaner_end_time, dt.database_transaction_begin_time AS oldest_transaction_begin_time, asdt.session_id AS active_transaction_session_id, asdt.elapsed_time_seconds AS active_transaction_elapsed_time_seconds, pvss.pvs_off_row_page_skipped_low_water_mark, pvss.pvs_off_row_page_skipped_min_useful_xts, pvss.pvs_off_row_page_skipped_oldest_aborted_xdesid -- SQL Server 2022 only FROM sys.dm_tran_persistent_version_store_stats AS pvss CROSS APPLY (SELECT SUM(size*8.) AS total_db_size_kb FROM sys.database_files WHERE [state] = 0 and [type] = 0 ) AS df LEFT JOIN sys.dm_tran_database_transactions AS dt ON pvss.oldest_active_transaction_id = dt.transaction_id AND pvss.database_id = dt.database_id LEFT JOIN sys.dm_tran_active_snapshot_database_transactions AS asdt ON pvss.min_transaction_timestamp = asdt.transaction_sequence_num OR pvss.online_index_min_transaction_timestamp = asdt.transaction_sequence_num WHERE pvss.database_id = DB_ID();
pvs_pct_of_database_sizesize, note any difference from the typical, compared to baselines during other periods of application activity. PVS is considered large if it's significantly larger than baseline or if it is close to 50% of the size of the database. Use the following steps as a troubleshooting aid for a PVS that is large.
Active, long-running transactions in any database where ADR is enabled can prevent cleanup of the PVS. Retrieve
oldest_active_transaction_idand check whether this transaction has been active for a long time by querying
sys.dm_tran_database_transactionsbased on the transaction ID. Check for long-running, active transactions with a query like the below sample, which declares variables to set thresholds for duration or log amount:
DECLARE @longTxThreshold int = 1800; --number of seconds to use as a duration threshold for long-running transactions DECLARE @longTransactionLogBytes bigint = 2147483648; --number of bytes to use as a log amount threshold for long-running transactions SELECT dbtr.database_id, transess.session_id, transess.transaction_id, atr.name, sess.login_time, dbtr.database_transaction_log_bytes_used, CASE WHEN getdate() >= dateadd(second, @longTxThreshold, tr.transaction_begin_time) then 'DurationThresholdExceeded' WHEN dbtr.database_transaction_log_bytes_used >= @longTransactionLogBytes then 'LogThresholdExceeded' ELSE 'unknown' END AS Reason FROM sys.dm_tran_active_transactions AS tr INNER JOIN sys.dm_tran_session_transactions AS transess on tr.transaction_id = transess.transaction_id INNER JOIN sys.dm_exec_sessions AS sess on transess.session_id = sess.session_id INNER JOIN sys.dm_tran_database_transactions AS dbtr on tr.transaction_id = dbtr.transaction_id INNER JOIN sys.dm_tran_active_transactions AS atr on atr.transaction_id = transess.transaction_id WHERE transess.session_id <> @@spid AND ( getdate() >= dateadd(second, @longTxThreshold, tr.transaction_begin_time) OR dbtr.database_transaction_log_bytes_used >= @longTransactionLogBytes );
With the session(s) identified, consider killing the session, if allowed. Also, review the application to determine the nature of the problematic active transaction(s).
For more information on troubleshooting long-running queries, see:
The persistent version cleanup may be held up due to long active snapshot scan(s). Statements using read-committed snapshot isolation (RCSI) or SNAPSHOT isolation levels receive instance-level timestamps. A snapshot scan uses the timestamp to decide the row visibility for the RCSI or SNAPSHOT transaction in the PVS where accelerated database recovery is enabled. Every statement using RCSI has its own timestamp, whereas SNAPSHOT isolation has a transaction-level timestamp. These instance-level transaction timestamps are used even in single-database transactions, because the transaction may be promoted to a cross-database transaction. Snapshot scans can therefore prevent cleanup of records in the ADR PVS, or when ADR is not present, in the
tempdbversion store. Therefore, due to this version tracking, long running transactions using SNAPSHOT or RCSI can cause ADR PVS to delay cleanup in database in the instance, causing the ADR PVS to grow in size.
In the original troubleshooting query at the top of this article, the
pvs_off_row_page_skipped_min_useful_xtsvalue shows the number of pages skipped for reclaim due to a long snapshot scan. If
pvs_off_row_page_skipped_min_useful_xtsshows a larger value than normal, it means there is a long snapshot scan preventing PVS cleanup.
This sample query can be used to decide which is the problematic session:
SELECT snap.transaction_id, snap.transaction_sequence_num, session.session_id, session.login_time, GETUTCDATE() as [now], session.host_name, session.program_name, session.login_name, session.last_request_start_time FROM sys.dm_tran_active_snapshot_database_transactions AS snap INNER JOIN sys.dm_exec_sessions AS session ON snap.session_id = session.session_id ORDER BY snap.transaction_sequence_num asc;
To prevent delays to PVS cleanup:
- Consider killing the long active transaction session that is delaying PVS cleanup, if possible. Long-running transactions in any database where ADR is enabled may delay ADR PVS cleanup.
- Tune long-running queries to reduce query duration and locks required. For more information and guidance, see Understand and resolve blocking in SQL Server or Understand and resolve Azure SQL Database blocking problems.
- Review the application to determine the nature of the problematic active snapshot scan. Consider a different isolation level, such as READ COMMITTED, instead of SNAPSHOT or READ COMMITTED SNAPSHOT for long-running queries that are delaying ADR PVS cleanup. This problem occurs more frequently with SNAPSHOT isolation level.
- This issue can occur in SQL Server, Azure SQL Managed Instance, and elastic pools of Azure SQL Database, but not in singleton Azure SQL databases. In Azure SQL Database elastic pools, consider moving databases out of the elastic pool that have long-running queries using READ COMMIT SNAPSHOT or SNAPSHOT isolation levels.
When the PVS size is growing due to long running transactions on primary or secondary replicas, investigate the long running queries and address the bottleneck. The
sys.dm_tran_aborted_transactionsDMV shows all aborted transactions. For more information, see sys.dm_tran_aborted_transactions (Transact-SQL). The
nest_abortedcolumn indicates that the transaction was committed, but there are portions that aborted (savepoints or nested transactions) which can block the PVS cleanup process.
If the database is part of an availability group, check the
secondary_low_water_mark. This is the same as the
sys.dm_hadr_database_replica_statesto see whether one of the replicas is holding this value behind, since this will also prevent PVS cleanup. The version cleanup is held up due to read queries on readable secondaries. Both SQL Server on-premises and Azure SQL DB support readable secondaries. In the
pvs_off_row_page_skipped_low_water_markcan also give indications of a secondary replica delay. For more information, see sys.dm_tran_persistent_version_store_stats.
The solution is same as snapshot scan hold up. Go to the secondaries, find the session that is issuing the long query and consider killing the session, if allowed. Note that the secondary hold up not only impacts ADR version cleanup, it can also prevent ghost records clean up.
online_index_min_transaction_timestampif the online PVS is holding up) and based on that check
sys.dm_tran_active_snapshot_database_transactionsfor the column
transaction_sequence_numto find the session that has the old snapshot transaction holding up PVS cleanup.
If none of the above applies, then it means that the cleanup is held by aborted transactions. Check the last time the
aborted_version_cleaner_last_end_timeto see if the aborted transaction cleanup has completed. The
oldest_aborted_transaction_idshould be moving higher after the aborted transaction cleanup completes. If the
oldest_aborted_transaction_idis much less than
current_abort_transaction_counthas a greater value, there is an old aborted transaction preventing PVS cleanup. To address:
- If possible, stop the workload to let version cleaner making progress.
- Optimize the workload to reduce object level lock usage.
- Review the application to see any high transaction abort issue. Deadlock, duplicate key and other constraint violations may trigger high abort rate.
- If on SQL Server, disable ADR as an emergency-only step to control both PVS size and abort transaction number. See Disable the ADR feature.
If the aborted transaction hasn't completed successfully recently, check the error log for messages reporting
Monitor the SQL Server error log for 'PreallocatePVS' entries. If there are 'PreallocatePVS' entries present, then this means you may need to increase the ADR ability to preallocate pages for background tasks as performance can be improved when the ADR background thread preallocates enough pages and the percentage of foreground PVS allocations is close to 0. You can use the
sp_configure 'ADR Preallocation Factor'to increase this amount. For more information, see ADR preallocation factor server configuration option.
Start PVS cleanup process manually
ADR is not recommended for database environments with a high transaction count of update/deletes, such as high-volume OLTP, without a period of rest/recovery for the PVS cleanup process to reclaim space.
To activate the PVS cleanup process manually between workloads or during maintenance windows, use the system stored procedure sys.sp_persistent_version_cleanup.
EXEC sys.sp_persistent_version_cleanup [database_name];
EXEC sys.sp_persistent_version_cleanup [WideWorldImporters];
Capture cleanup failures
Beginning with SQL Server 2022 (16.x), SQL Server records ADR PVS cleanup behavior in the SQL Server error log. Typically this would result in a new log event recorded every 10 minutes.
Submit and view feedback for