Obučavanje
Put učenja
Implement Windows Server high availability - Training
Implement Windows Server high availability
Ovaj preglednik više nije podržan.
Nadogradite na Microsoft Edge da iskoristite najnovije osobine, sigurnosna ažuriranja i tehničku podršku.
Applies to:
SQL Server on Azure VM
A Windows Server Failover Cluster is used for high availability and disaster recovery (HADR) with SQL Server on Azure Virtual Machines (VMs).
This article provides cluster configuration best practices for both failover cluster instances (FCIs) and availability groups when you use them with SQL Server on Azure VMs.
To learn more, see the other articles in this series: Checklist, VM size, Storage, Security, HADR configuration, Collect baseline.
Review the following checklist for a brief overview of the HADR best practices that the rest of the article covers in greater detail.
High availability and disaster recovery (HADR) features, such as the Always On availability group and the failover cluster instance rely on underlying Windows Server Failover Cluster technology. Review the best practices for modifying your HADR settings to better support the cloud environment.
For your Windows cluster, consider these best practices:
For your SQL Server availability group or failover cluster instance, consider these best practices:
Lease timeout < (2 * SameSubnetThreshold * SameSubnetDelay)
.SameSubnetThreshold
and SameSubnetDelay
values recommended previously, don't exceed 80 seconds for the lease timeout value.MultiSubnetFailover = true
in the connection string, even if your cluster only spans one subnet.
MultiSubnetFailover = True
you may need to set RegisterAllProvidersIP = 0
and HostRecordTTL = 300
to cache client credentials for shorter durations. However, doing so may cause additional queries to the DNS server.MultiSubnetFailover = True
, and this parameter must be in the connection string.netsh int ipv4 add excludedportrange tcp startport=59999 numberofports=1 store=persistent
To compare the HADR checklist with the other best practices, see the comprehensive Performance best practices checklist.
To reduce the effect of downtime, consider the following VM best availability settings:
Although a two-node cluster functions without a quorum resource, customers are strictly required to use a quorum resource to have production support. Cluster validation doesn't pass any cluster without a quorum resource.
Technically, a three-node cluster can survive a single node loss (down to two nodes) without a quorum resource, but after the cluster is down to two nodes, if there's another node loss or communication failure, then there's a risk that the clustered resources will go offline to prevent a split-brain scenario. Configuring a quorum resource allows the cluster to continue online with only one node online.
The disk witness is the most resilient quorum option, but to use a disk witness on a SQL Server on Azure VM, you must use an Azure Shared Disk, which imposes some limitations to the high availability solution. As such, use a disk witness when you're configuring your failover cluster instance with Azure Shared Disks, otherwise use a cloud witness whenever possible.
The following table lists the quorum options available for SQL Server on Azure VMs:
Cloud witness | Disk witness | File share witness | |
---|---|---|---|
Supported OS | Windows Server 2016+ | All | All |
To get started, see Configure cluster quorum.
It's possible to change the quorum vote of a node participating in a Windows Server Failover Cluster.
When modifying the node vote settings, follow these guidelines:
Quorum voting guidelines |
---|
Start with each node having no vote by default. Each node should only have a vote with explicit justification. |
Enable votes for cluster nodes that host the primary replica of an availability group, or the preferred owners of a failover cluster instance. |
Enable votes for automatic failover owners. Each node that might host a primary replica or FCI as a result of an automatic failover should have a vote. |
If an availability group has more than one secondary replica, only enable votes for the replicas that have automatic failover. |
Disable votes for nodes that are in secondary disaster recovery sites. Nodes in secondary sites shouldn't contribute to the decision of taking a cluster offline if there's nothing wrong with the primary site. |
Have an odd number of votes, with three quorum votes minimum. Add a quorum witness for an additional vote if necessary in a two-node cluster. |
Reassess vote assignments post-failover. You don't want to fail over into a cluster configuration that doesn't support a healthy quorum. |
To match the on-premises experience for connecting to your availability group listener or failover cluster instance, deploy your SQL Server VMs to multiple subnets within the same virtual network. Having multiple subnets negates the need for the extra dependency on an Azure Load Balancer, or a distributed network name to route your traffic to your listener.
To simplify your HADR solution, deploy your SQL Server VMs to multiple subnets whenever possible. To learn more, see Multi-subnet AG, and Multi-subnet FCI.
If your SQL Server VMs are in a single subnet, it's possible to configure either a virtual network name (VNN) and an Azure Load Balancer, or a distributed network name (DNN) for both failover cluster instances and availability group listeners.
The distributed network name is the recommended connectivity option, when available:
Consider the following limitations:
MultiSubnetFailover=True
parameter.To learn more, see the Windows Server Failover Cluster overview.
To configure connectivity, see the following articles:
Most SQL Server features work transparently with FCI and availability groups when using the DNN, but there are certain features that might require special consideration. See FCI and DNN interoperability and AG and DNN interoperability to learn more.
Savjet
Set the MultiSubnetFailover parameter = true in the connection string even for HADR solutions that span a single subnet to support future spanning of subnets without needing to update connection strings.
Change the cluster heartbeat and threshold settings to relaxed settings. The default heartbeat and threshold cluster settings are designed for highly tuned on-premises networks and don't consider the possibility of increased latency in a cloud environment. The heartbeat network is maintained with UDP 3343, which is traditionally far less reliable than TCP and more prone to incomplete conversations.
Therefore, when running cluster nodes for SQL Server on Azure VM high availability solutions, change the cluster settings to a more relaxed monitoring state to avoid transient failures due to the increased possibility of network latency or failure, Azure maintenance, or hitting resource bottlenecks.
The delay and threshold settings have a cumulative effect to total health detection. For example, setting CrossSubnetDelay to send a heartbeat every 2 seconds and setting the CrossSubnetThreshold to 10 missed heartbeats before taking recovery means the cluster can have a total network tolerance of 20 seconds before recovery action is taken. In general, continuing to send frequent heartbeats but having greater thresholds is preferred.
To ensure recovery during legitimate outages while providing greater tolerance for transient issues, relax your delay and threshold settings to the recommended values detailed in the following table:
Setting | Windows Server 2012 or later | Windows Server 2008 R2 |
---|---|---|
SameSubnetDelay | 1 second | 2 second |
SameSubnetThreshold | 40 heartbeats | 10 heartbeats (max) |
CrossSubnetDelay | 1 second | 2 second |
CrossSubnetThreshold | 40 heartbeats | 20 heartbeats (max) |
Use PowerShell to change your cluster parameters:
(get-cluster).SameSubnetThreshold = 40
(get-cluster).CrossSubnetThreshold = 40
Use PowerShell to verify your changes:
get-cluster | fl *subnet*
Consider the following:
Choose relaxed values based on how much down time is tolerable and how long before a corrective action should occur depending on your application, business needs, and your environment. If you're not able to exceed the default Windows Server 2019 values, then at least try to match them, if possible:
For reference, the following table details the default values:
Setting | Windows Server 2019 | Windows Server 2016 | Windows Server 2008 - 2012 R2 |
---|---|---|---|
SameSubnetDelay | 1 second | 1 second | 1 second |
SameSubnetThreshold | 20 heartbeats | 10 heartbeats | 5 heartbeats |
CrossSubnetDelay | 1 second | 1 second | 1 second |
CrossSubnetThreshold | 20 heartbeats | 10 heartbeats | 5 heartbeats |
To learn more, see Tuning Failover Cluster Network Thresholds.
If tuning your cluster heartbeat and threshold settings as recommended is insufficient tolerance and you're still seeing failovers due to transient issues rather than true outages, you can configure your AG or FCI monitoring to be more relaxed. In some scenarios, it might be beneficial to temporarily relax the monitoring for a period of time given the level of activity. For example, you might want to relax the monitoring when you're doing IO intensive workloads such as database backups, index maintenance, DBCC CHECKDB, etc. Once the activity is complete, set your monitoring to less relaxed values.
Upozorenje
Changing these settings might mask an underlying problem, and should be used as a temporary solution to reduce, rather than eliminate, the likelihood of failure. Underlying issues should still be investigated and addressed.
Start by increasing the following parameters from their default values for relaxed monitoring, and adjust as necessary:
Parameter | Default value | Relaxed Value | Description |
---|---|---|---|
Healthcheck timeout | 30000 | 60000 | Determines health of the primary replica or node. The cluster resource DLL sp_server_diagnostics returns results at an interval that equals 1/3 of the health-check timeout threshold. If sp_server_diagnostics is slow or isn't returning information, the resource DLL waits for the full interval of the health-check timeout threshold before determining that the resource is unresponsive, and initiating an automatic failover, if configured to do so. |
Failure-Condition Level | 3 | 2 | Conditions that trigger an automatic failover. There are five failure-condition levels, which range from the least restrictive (level one) to the most restrictive (level five) |
Use Transact-SQL (T-SQL) to modify the health check and failure conditions for both AGs and FCIs.
For availability groups:
ALTER AVAILABILITY GROUP AG1 SET (HEALTH_CHECK_TIMEOUT =60000);
ALTER AVAILABILITY GROUP AG1 SET (FAILURE_CONDITION_LEVEL = 2);
For failover cluster instances:
ALTER SERVER CONFIGURATION SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 60000;
ALTER SERVER CONFIGURATION SET FAILOVER CLUSTER PROPERTY FailureConditionLevel = 2;
Specific to availability groups, start with the following recommended parameters, and adjust as necessary:
Parameter | Default value | Relaxed Value | Description |
---|---|---|---|
Lease timeout | 20000 | 40000 | Prevents split-brain. |
Session timeout | 10000 | 20000 | Checks communication issues between replicas. The session-timeout period is a replica property that controls how long (in seconds) that an availability replica waits for a ping response from a connected replica before considering the connection to have failed. By default, a replica waits 10 seconds for a ping response. This replica property applies to only the connection between a given secondary replica and the primary replica of the availability group. |
Max failures in specified period | 2 | 6 | Used to avoid indefinite movement of a clustered resource within multiple node failures. Too low of a value can lead to the availability group being in a failed state. Increase the value to prevent short disruptions from performance issues as too low a value can lead to the AG being in a failed state. |
Before making any changes, consider the following:
Lease timeout < (2 * SameSubnetThreshold * SameSubnetDelay)
.SameSubnetThreshold
and SameSubnetDelay
values recommended previously, don't exceed 80 seconds for the lease timeout value.Lease timeout
Use the Failover Cluster Manager to modify the lease timeout settings for your availability group. See the SQL Server availability group lease health check documentation for detailed steps.
Session timeout
Use Transact-SQL (T-SQL) to modify the session timeout for an availability group:
ALTER AVAILABILITY GROUP AG1
MODIFY REPLICA ON 'INSTANCE01' WITH (SESSION_TIMEOUT = 20);
Max failures in specified period
Use the Failover Cluster Manager to modify the Max failures in specified period value:
VM or disk limits could result in a resource bottleneck that impacts the health of the cluster, and impedes the health check. If you're experiencing issues with resource limits, consider the following:
Deploy your SQL Server VMs to multiple subnets whenever possible to avoid the dependency on an Azure Load Balancer or a distributed network name (DNN) to route traffic to your HADR solution.
Use a single NIC per server (cluster node). Azure networking has physical redundancy, which makes additional NICs unnecessary on an Azure virtual machine guest cluster. The cluster validation report warns you that the nodes are reachable only on a single network. You can ignore this warning on Azure virtual machine guest failover clusters.
Bandwidth limits for a particular VM are shared across NICs and adding an additional NIC doesn't improve availability group performance for SQL Server on Azure VMs. As such, there's no need to add a second NIC.
The non-RFC-compliant DHCP service in Azure can cause the creation of certain failover cluster configurations to fail. This failure happens because the cluster network name is assigned a duplicate IP address, such as the same IP address as one of the cluster nodes. This is an issue when you use availability groups, which depend on the Windows failover cluster feature.
Consider the scenario when a two-node cluster is created and brought online:
You can avoid this scenario by assigning an unused static IP address to the cluster network name in order to bring the cluster network name online and add the IP address to Azure Load Balancer.
If the SQL Server database engine, Always On availability group listener, failover cluster instance health probe, database mirroring endpoint, cluster core IP resource, or any other SQL resource is configured to use a port between 49,152 and 65,536 (the default dynamic port range for TCP/IP), add an exclusion for each port. Doing so prevents other system processes from being dynamically assigned the same port. The following example creates an exclusion for port 59999:
netsh int ipv4 add excludedportrange tcp startport=59999 numberofports=1 store=persistent
It's important to configure the port exclusion when the port isn't in use, otherwise the command fails with a message like "The process can't access the file because it's being used by another process."
To confirm that the exclusions have been configured correctly, use the following command: netsh int ipv4 show excludedportrange tcp
.
Setting this exclusion for the availability group role IP probe port should prevent events such as Event ID: 1069 with status 10048. This event can be seen in the Windows Failover cluster events with the following message:
Cluster resource '<IP name in AG role>' of type 'IP Address' in cluster role '<AG Name>' failed.
An Event ID: 1069 with status 10048 can be identified from cluster logs with events like:
Resource IP Address 10.0.1.0 called SetResourceStatusEx: checkpoint 5. Old state OnlinePending, new state OnlinePending, AppSpErrorCode 0, Flags 0, nores=false
IP Address <IP Address 10.0.1.0>: IpaOnlineThread: **Listening on probe port 59999** failed with status **10048**
Status [**10048**](/windows/win32/winsock/windows-sockets-error-codes-2) refers to: **This error occurs** if an application attempts to bind a socket to an **IP address/port that has already been used** for an existing socket.
This can be caused by an internal process taking the same port defined as probe port. Remember that probe port is used to check the status of a backend pool instance from the Azure Load Balancer.
If the health probe fails to get a response from a backend instance, then no new connections will be sent to that backend instance until the health probe succeeds again.
Review the resolutions for some commonly known issues and errors.
Exhausting I/O or CPU capacity for the VM can cause your availability group to fail over. Identifying the contention that happens right before the failover is the most reliable way to identify what is causing automatic failover.
Use I/O Analysis (Preview) in the Azure portal to identify disk performance issues that can cause a failover.
Monitor Azure Virtual Machines to look at the Storage IO Utilization metrics to understand VM or disk level latency.
Follow these steps to review the Azure VM Overall IO Exhaustion event:
Navigate to your Virtual Machine in the Azure portal - not the SQL virtual machines.
Select Metrics under Monitoring to open the Metrics page.
Select Local time to specify the time range you're interested in, and the time zone, either local to the VM, or UTC/GMT.
Select Add metric to add the following two metrics to see the graph:
It's possible that an Azure VM HostEvent causes your availability group to fail over. If you believe an Azure VM HostEvent caused a failover, you can check the Azure Monitor Activity log, and the Azure VM Resource Health overview.
The Azure Monitor activity log is a platform log, in Azure, which provides insight into subscription-level events. The activity log includes information such as when a resource is modified, or a virtual machine is started. You can view the activity log in the Azure portal, or retrieve entries with PowerShell and the Azure CLI.
To check the Azure Monitor activity log, follow these steps:
Navigate to your Virtual Machine in Azure portal
Select Activity Log on the Virtual Machine pane
Select Timespan and then choose the time frame when your availability group failed over. Select Apply.
If Azure has further information about the root cause of a platform-initiated unavailability, that information might be posted on the Azure VM - Resource Health overview page up to 72 hours after the initial unavailability. This information is only available for virtual machines at this time.
You can also configure alerts based on health events from this page.
If the Windows Cluster heartbeat and threshold settings are too aggressive for your environment, you might see following message in the system event log frequently.
Error 1135
Cluster node 'Node1' was removed from the active failover cluster membership.
The Cluster service on this node may have stopped. This could also be due to the node having
lost communication with other active nodes in the failover cluster. Run the Validate a
Configuration Wizard to check your network configuration. If the condition persists, check
for hardware or software errors related to the network adapters on this node. Also check for
failures in any other network components to which the node is connected such as hubs, switches, or bridges.
For more information, review Troubleshooting cluster issue with Event ID 1135.
If monitoring is too aggressive for your environment, you might see frequent availability group or FCI restarts, failures, or failovers. Additionally for availability groups, you might see the following messages in the SQL Server error log:
Error 19407: The lease between availability group 'PRODAG' and the Windows Server Failover Cluster has expired.
A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster.
To determine whether the availability group is failing over correctly, check the corresponding availability group
resource in the Windows Server Failover Cluster
Error 19419: The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster
failed because the existing lease is no longer valid.
If the session timeout is too aggressive for your availability group environment, you might see following messages frequently:
Error 35201: A connection timeout has occurred while attempting to establish a connection to availability
replica 'replicaname' with ID [availability_group_id]. Either a networking or firewall issue exists,
or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
Error 35206
A connection timeout has occurred on a previously established connection to availability
replica 'replicaname' with ID [availability_group_id]. Either a networking or a firewall issue
exists, or the availability replica has transitioned to the resolving role.
If the Maximum Failures in the Specified Period value is too low and you're experiencing intermittent failures due to transient issues, your availability group could end in a failed state. Increase this value to tolerate more transient failures.
Not failing over group <Resource name>, failoverCount 3, failoverThresholdSetting <Number>, computedFailoverThreshold 2.
This can happen if the Storage Spaces property AutomaticClusteringEnabled
is set to True
for an AG environment. Change it to False
. Also, running a Validation Report with Storage option can trigger the disk reset or surprise removed event. The storage system Throttling can also trigger the disk surprise remove event.
The computer object associated with the resource couldn't be updated in the domain. Make sure you have appropriate permissions on domain
You might encounter issues while setting up a Windows failover cluster or its connectivity if you don't have Cluster Service Ports open for communication.
If you are on Windows Server 2019, and you don't see a Windows Cluster IP, you have configured Distributed Network Name, which is only supported on SQL Server 2019. If you have previous versions of SQL Server, you can remove and Recreate the Cluster using Network Name.
Review other Windows Failover Clustering Events Errors and their Solutions here
To learn more, see:
Obučavanje
Put učenja
Implement Windows Server high availability - Training
Implement Windows Server high availability
Dokumentacija
Tutorial: Prerequisites for AG in multiple subnets - SQL Server on Azure VMs
This tutorial shows how to configure the prerequisites for creating an Always On availability group in multiple subnets for SQL Server on Azure Virtual Machines (VMs).
Learn how to migrate a SQL Server Always On availability group from a single subnet to a multiple subnet (multi-subnet) environment).
Overview of SQL Server Always On availability groups - SQL Server on Azure VMs
This article introduces SQL Server Always On availability groups on Azure Virtual Machines.