How It Works: SQL Server AlwaysOn Lease Timeout
Based in recent cases I needed to extend and clarify this post.
The LeaseTimeout controls the lease mechanism and when the lease expires there is a very high probability of a system wide event taking place. Let me explain more and why I make this statement.
The SQL Server resource DLL is responsible for the lease heartbeat activity. Every 1/4 of the LeaseTimeout setting the dedicated, lease thread wakes up and attempts to renew the lease. The lease is a simple handshake between the resource DLL and the SQL Server instance supporting the AG on the same node.
The activity is a two(2) way handshake using a pair of named events. Here is the simple interaction.
|Resource DLL Lease Thread||Server Lease Worker/Thread|
|WaitForSingleObject(Server Event)||SetEvent(Server Event)|
|Wait for 1/4 the Lease Timeout||Repeat Loop Until Shutdown or Lease Expires|
**Repeat Loops Until Shutdown or Lease Expires
In addition to this handshake the ‘server lease worker/thread’ is a preemptive worker (think dedicated thread outside of SQLOS control – OS scheduling controlled only) running at ABOVE NORMAL PRIORITY.
For a lease to expire the server lease worker is not able to wait on and respond to the event handshake when it is dedicated to the task and running at a priority above most threads on the system. This is why I stated the lease timeout is commonly due to a system wide event.
If the lease timeout is expiring you likely have a system wide, impacting event taking place that needs to be resolved. If you are experiencing lease timeouts troubleshoot the entire computer you are likely encountering a broader issue than just a SQL Server lease timeout.
For example, one recent case revealed that during the lease renewal process we could not even collect performance monitor data nor was a single message logged in the cluster log. In another case the an application on the system caused massive paging to occur, slowing the entire system.
The lease is used between the SQL Server resource DLL and the SQL Server instance to prevent split-brain from occurring for the availability group (AG).
The lease is only present on the primary replica, making sure the SQL Server and Windows cluster state for the availability group remain synchronized.
The Windows cluster components poll to determine if the resource IsAlive or LooksAlive on regular intervals. The resource dll must report the state of the resource to the Windows clustering components. For those familiar with the older, SQL Server failover cluster instances (FCIs) this was the accomplished with generic query execution every ## of seconds to see of the server 'looks alive.'
The new lease design removes all the connectivity components and problems associated with that additional overhead and provides a streamlined design to determine if the SQL Server 'looks alive.' The resource dll and the SQL Server instance use the named memory objects, in shared memory, to communicate. The objects are signaled and checked at regular intervals.
The default signaling interval is 1/3 of the configured 'Health Check Timeout' of the availability group.
[Nov 2013 Correction] Default heartbeat interval is 1/4 the LeaseTimeout
If the LeaseTimeout is exceeded without the signal exchange the lease is declared 'expired' and the SQL Server resource dll reports that the SQL Server availability group no longer 'looks alive' to the Windows cluster manager. The cluster manager undertakes the configured corrective actions. SQL Server prevents further data modifications (avoiding split-brain issues) on the current primary. The cluster manager activity helps select the proper primary location and attempts to online the availability group.
The following is a sample message from the SQL Server error log when the lease has expired.
|Error: 19407, Severity: 16, State: 1.
The lease between availability group 'MyAG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
AlwaysOn: The local replica of availability group 'MyAG' is going offline because either the lease expired or lease renewal failed. This is an informational message
Looking at the …\MSSQL\LOG\*DIAG*.XEL file for this issue you can see failures reported. Notice the 'Resource Alive result: 0' - The SQL Server resource dll is going to report to the cluster manager that the availability group DOES NOT LOOK ALIVE.
Note: SQL Server Management Studiomay adjust time values based on your clients time zone settings.
The matching cluster log can output similar information as well. Notice you have to adjust for UTC time.
|000015ec.00002a64::2012/09/06-05:34:56.019 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'warning' to 'clean' at 2012-09-06 06:34:56.017 000015ec.00002a64::2012/09/06-05:35:36.050 WARN [RES] SQL Server Availability Group: [hadrag] Failed to retrieve data column. Return code -1 000015ec.00001a04::2012/09/06-05:35:36.050 ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost|
The million dollar question is still why?
The answer is that 'it depends.' In this instance the SQL Server has encountered a system level problem and is stuck attempting to allocate memory and generating dump files. Looking at the stacks for the dump 100s of threads are stalled, attempting to allocate memory, indicating a memory stall or problem on the overall system at some level and preventing SQL Server from processing work.
Bob Dorr - Principal SQL Server Escalation Engineer