How It Works: SQL Server AlwaysOn Lease Timeout
Nov 2013
Based in recent cases I needed to extend and clarify this post.
The LeaseTimeout controls the lease mechanism and when the lease expires there is a very high probability of a system wide event taking place. Let me explain more and why I make this statement.
The SQL Server resource DLL is responsible for the lease heartbeat activity. Every 1/4 of the LeaseTimeout setting the dedicated, lease thread wakes up and attempts to renew the lease. The lease is a simple handshake between the resource DLL and the SQL Server instance supporting the AG on the same node.
The activity is a two(2) way handshake using a pair of named events. Here is the simple interaction.
Resource DLL Lease Thread | Server Lease Worker/Thread |
SetEvent(Client Event) | |
WaitForSingleObject(Client Event) | |
WaitForSingleObject(Server Event) | SetEvent(Server Event) |
Wait for 1/4 the Lease Timeout | Repeat Loop Until Shutdown or Lease Expires |
**Repeat Loops Until Shutdown or Lease Expires
In addition to this handshake the ‘server lease worker/thread’ is a preemptive worker (think dedicated thread outside of SQLOS control – OS scheduling controlled only) running at ABOVE NORMAL PRIORITY.
For a lease to expire the server lease worker is not able to wait on and respond to the event handshake when it is dedicated to the task and running at a priority above most threads on the system. This is why I stated the lease timeout is commonly due to a system wide event.
If the lease timeout is expiring you likely have a system wide, impacting event taking place that needs to be resolved. If you are experiencing lease timeouts troubleshoot the entire computer you are likely encountering a broader issue than just a SQL Server lease timeout.
For example, one recent case revealed that during the lease renewal process we could not even collect performance monitor data nor was a single message logged in the cluster log. In another case the an application on the system caused massive paging to occur, slowing the entire system.
------------------------------------------------------------------------------------------
Sep 2012
The lease is used between the SQL Server resource DLL and the SQL Server instance to prevent split-brain from occurring for the availability group (AG).
The lease is a standard signaling mechanism between the SQL Server resource DLL and the SQL Server availability group. The figure below depicts the general flow of the lease.
The lease is only present on the primary replica, making sure the SQL Server and Windows cluster state for the availability group remain synchronized.
The Windows cluster components poll to determine if the resource IsAlive or LooksAlive on regular intervals. The resource dll must report the state of the resource to the Windows clustering components. For those familiar with the older, SQL Server failover cluster instances (FCIs) this was the accomplished with generic query execution every ## of seconds to see of the server 'looks alive.'
The new lease design removes all the connectivity components and problems associated with that additional overhead and provides a streamlined design to determine if the SQL Server 'looks alive.' The resource dll and the SQL Server instance use the named memory objects, in shared memory, to communicate. The objects are signaled and checked at regular intervals.
The default signaling interval is 1/3 of the configured 'Health Check Timeout' of the availability group.
[Nov 2013 Correction] Default heartbeat interval is 1/4 the LeaseTimeout
If the LeaseTimeout is exceeded without the signal exchange the lease is declared 'expired' and the SQL Server resource dll reports that the SQL Server availability group no longer 'looks alive' to the Windows cluster manager. The cluster manager undertakes the configured corrective actions. SQL Server prevents further data modifications (avoiding split-brain issues) on the current primary. The cluster manager activity helps select the proper primary location and attempts to online the availability group.
The following is a sample message from the SQL Server error log when the lease has expired.
Error: 19407, Severity: 16, State: 1.
The lease between availability group 'MyAG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster. AlwaysOn: The local replica of availability group 'MyAG' is going offline because either the lease expired or lease renewal failed. This is an informational message |
Looking at the …\MSSQL\LOG\*DIAG*.XEL file for this issue you can see failures reported. Notice the 'Resource Alive result: 0' - The SQL Server resource dll is going to report to the cluster manager that the availability group DOES NOT LOOK ALIVE.
Note: SQL Server Management Studiomay adjust time values based on your clients time zone settings.
The matching cluster log can output similar information as well. Notice you have to adjust for UTC time.
000015ec.00002a64::2012/09/06-05:34:56.019 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'warning' to 'clean' at 2012-09-06 06:34:56.017 000015ec.00002a64::2012/09/06-05:35:36.050 WARN [RES] SQL Server Availability Group: [hadrag] Failed to retrieve data column. Return code -1 000015ec.00001a04::2012/09/06-05:35:36.050 ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost |
The million dollar question is still why? The answer is that 'it depends.' In this instance the SQL Server has encountered a system level problem and is stuck attempting to allocate memory and generating dump files. Looking at the stacks for the dump 100s of threads are stalled, attempting to allocate memory, indicating a memory stall or problem on the overall system at some level and preventing SQL Server from processing work. |
Bob Dorr - Principal SQL Server Escalation Engineer
Comments
Anonymous
January 17, 2013
Correction: When I stated: "If the HealhCheckTimeout is exceeded without the signal exchange the lease is declared 'expired' and " it should have stated LeaseTimeout instead. Sorry for the confusion.Anonymous
February 10, 2013
I am hitting these timeouts even with both the LeaseTimeout and HealthCheckTimeout settings pumped up to their max. And the Extended Events session has no obviously useful information. What could be causing it? This is with one primary and one mirror, and using async mode.Anonymous
May 02, 2013
The comment has been removedAnonymous
April 17, 2015
Before digging too deep into system-level issues, check what is going on inside SQL Server. The sp_server_diagnostics stored procedure is driving the health check that OS resource DLL is executing. If that stored proc does not return a result after the LeaseTimeout threshold is passed, then SQL will be considered dead and a restart or failover of the cluster resource will be initiated. Happened to me. TempDB pressure caused this stored proc to run long. Saw all the same error messages that are listed in this post.Anonymous
July 20, 2016
Good one