Enhance AlwaysOn Failover Policy to Test SQL Server Responsiveness
The AlwaysOn failover policy for AlwaysOn availability groups monitors the health of the SQL Server process hosting the primary replica. For example, one health check mechanism ensures that SQL Server is responsive. The SQL Server Resource DLL establishes a local ODBC connection and SQL Server responds to the session within the availability group's HEALTH_CHECK_TIMEOUT setting, which is 30 seconds, by default.
SQL Server responds to the ODBC session using a thread running at ABOVE NORMAL PRIORITY. As a result, SQL Server could develop a health issue that impacts SQL Server worker threads running at NORMAL PRIORITY, servicing your application sessions, but is able to respond to the SQL Resource DLL health check within the HEALTH_CHECK_TIMEOUT period.
In this scenario AlwaysOn failover policy is unable to detect a health issue with SQL Server even though it is not responding to your application's connection and query requests as expected.
You may add a generic script clustered resource to the availability group, which connects to SQL Server and performs a simple query. As a member resource of the availability group resource, if the script resource fails to connect or execute the query, Windows Cluster will attempt to restart or failover the availability group resource.
Attached to this blog is a zipped file, GenericScript_SQLIsAlive.zip, containing
sqlisalive.vbs <- The generic script written in Visual Basic Script and implements Windows Cluster IsAlive
Add_SQLIsAliveScript.ps1 <- The PowerShell script which adds the generic script resource to your availability group resource
readme.txt <- Step by step instructions for implementing the generic script resource and additional instructions on how to test the script.
The generic script resource executes IsAlive every 60 seconds and makes an ODBC connection to the local SQL Server hosting the availability group's primary replica. It tests for successful connection and query execution. It can also connect in the context of a designated availability group database to test database access.
SQL Server Instance The generic script, sqlisalive.vbs, is configured to connect to the default instance of SQL Server using a local pipe connection at both the primary and secondary, if deployed as is. If this is not the case in your environment, for example, the primary is hosted on the default instance of SQL Server and the secondary replica is hosted on a named instance of SQL Server, the script will need to be uniquely modified on each server, to connect to the local instance using the connection string below:
sConn="Driver={SQL Server Native Client 11.0};Server=lpc:(local);Database=" & Db & ";Trusted_Connection=Yes"
Connection and Command Timeouts The generic script, sqlisalive.vbs, contains the implementation of the IsAlive function, so the script is executed every 60 seconds. By default the script is configured with the following connection and query timeout settings. Change these to settings to thresholds that reflect the delay you intend your application to tolerate. The default in the script is:
oConn.ConnectionTimeout = 20
oConn.CommandTimeout = 10
Database Health Check AlwaysOn failover policy is designed to monitor the health of the SQL Server process hosting the primary replica of your availability group and does not detect the health of availability databases. The generic script is configured to connect to the master database. The script can be modified to test for availability database access by changing the Db variable to the desired availability database:
Db="master"
IMPORTANT On failover, the availability group and its resources will need to successfully test the new primary replica environment. If you intend to test database health with the script resource, be aware that long database recovery on the new primary may cause the script resource failing which would prevent the availability group from coming online on the new primary replica. If long recovery is a possibility on failover, steps must be taken in the script to account for this delay if database health is to be detected.
NOTE: It is best to force a local conneciton (lpc:servername or lpc:(local)) to make sure the script is connecting only to the local primary instance. The attached generic script uses a local connection.
1 Ensure your availability group has two replicas configured for automatic failover.
2 Copy the generic script file, sqlisalive.vbs to an identical local storage location like 'C:\temp\sqlisalive.vbs' on both servers whose replicas are configured for automatic failover.
3 On the SQL Server hosting the availability group primary replica, launch Windows PowerShell ISE as Administrator.
4 Use File / Open to open the Add_SQLIsAliveScript.ps1 script.
5 In the Add_SQLIsAliveScript.ps1, change the availability group resource variable '$ag' at the top of the script to your availability group name.
6 For the '$scriptfilepath' variable, set the correct path/file name in Add_SQLIsAliveScript.ps1 to the location of the generic script file on both servers. The sample PowerShell script currently uses C:\temp\sqlisalive.vbs.
7 Execute the modified PowerShell script Add_SQLIsAliveScript.ps1 to add the generic script resource to your availability group resource group.
8 Launch Failover Cluster Manager and review the availability group resource group to confirm addition of the generic script resource to the availability group resource group. The generic script should appear and come online in the availability group resource group under the Resources tab.
NOTES
The attached readme.txt file has instructions on how to test the script resource to ensure that it can failover your availability group resource.
Generate the cluster log for the node hosting the primary replica and search for 'SQLCommandFailed' or SQLConnectionFailed' - you should find the error and description. For example, on a query timeout, the following is found in the Cluster log:
00000b58.00000600::2014/10/09-17:56:05.307 INFO [RES] Generic Script <sqlisalive>: Entering IsAlive
00000b58.00000600::2014/10/09-17:56:05.307 INFO [RES] Generic Script <sqlisalive>: IsAlive SQLCommandFailed
00000b58.00000600::2014/10/09-17:56:05.307 INFO [RES] Generic Script <sqlisalive>: Error: -2147217871; Native Error: 0; SQL State: S1T00; Source: Microsoft OLE DB Provider for ODBC Drivers
00000b58.00000600::2014/10/09-17:56:05.307 INFO [RES] Generic Script <sqlisalive>: [Microsoft][SQL Server Native Client 11.0]Query timeout expired
00000b58.00000600::2014/10/09-17:56:05.307 ERR [RES] Generic Script <sqlisalive>: 'IsAlive' script entry point returned FALSE.'
Or for example, there was a problem detected connecting to SQL Server:
00001f4c.0000149c::2014/10/09-18:03:45.083 INFO [RES] Generic Script <sqlisalive>: Entering IsAlive
00001f4c.0000149c::2014/10/09-18:03:45.133 INFO [RES] Generic Script <sqlisalive>: IsAlive SQLConnectionFailed
00001f4c.0000149c::2014/10/09-18:03:45.133 INFO [RES] Generic Script <sqlisalive>: Error: -2147217843; Native Error: 18456; SQL State: 28000; Source: Microsoft OLE DB Provider for ODBC Drivers
00001f4c.0000149c::2014/10/09-18:03:45.133 INFO [RES] Generic Script <sqlisalive>: [Microsoft][SQL Server Native Client 11.0][SQL Server]Login failed for user ''.
00001f4c.0000149c::2014/10/09-18:03:45.133 ERR [RES] Generic Script <sqlisalive>: 'IsAlive' script entry point returned FALSE.'