Failover

The Cluster service attempts to fail over a group when:

  • The node currently hosting the group becomes inactive for any reason.
  • One of the resources within the group fails, and the resource's RestartAction property is set to ClusterResourceRestartNotify.
  • An administrator or developer forces failover.

A failover attempt consists of the following steps:

  1. The Cluster service takes all the resources in the group offline in an order determined by the group's dependency hierarchy: dependent resources first, followed by the resources on which they depend. For example, if an application depends on a Physical Disk resource, the Cluster service takes the application offline first, allowing the application to write changes to the disk before the disk is taken offline.

    The Cluster service takes a resource offline by invoking, through a Resource Monitor, the Offline entry point function of the resource DLL managing the resource. Offline implements shutdown procedures for the resource. If the resource does not shut down within a time limit specified by the resource's PendingTimeout property, the Cluster service forcefully terminates the resource by calling the resource DLL's Terminate entry point function.

  2. When all of the resources are offline, the Cluster service attempts to transfer the group to the node that is listed next on the group's list of preferred host nodes.

  3. If the Cluster service successfully moves the group to another node, it attempts to bring all of the group's resources online, this time starting at the bottom of the dependency hierarchy. Failover is complete when all of the group's resources are online on the new node.

The Cluster service continues attempting to fail over a group until it succeeds or until a specified number of attempts have been made within a specified time span. A group's FailoverThreshold and FailoverPeriod properties specify a maximum number of failover attempts that can occur in an interval of time. If the Cluster service exceeds this limit, it concludes that the group cannot be brought online anywhere in the cluster and stops trying to fail over the group.

Ways to control failover policy

  • Define how the failover cluster detects and responds to the failure of individual resources in the group. For information, see Resource Failure.
  • Control the order in which the Cluster service takes resources offline by establishing dependency relationships between resources. For information, see Resource Dependencies.
  • When developing custom resource types, provide implementations of Offline specific to the needs of the resource being supported. Offline should shut down the resource quickly and gracefully. For information, see Implementing Offline.
  • Specify a resource's PendingTimeout property to control how long the Cluster service waits for the resource to shut down gracefully.
  • Maintain an accurate, prioritized list of the nodes that can act as host for the group. Make sure the nodes on the list have the capacity to host the group effectively. For information, see SetClusterGroupNodeList and Enumerating Objects.
  • Adjust a group's FailoverThreshold and FailoverPeriod properties to control how the group responds to unsuccessful failover. For more information on working with properties, see Setting Properties.