Failover and failback

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

Failover and failback

Failover

If an individual application in a server cluster fails (but the node does not), the Cluster service typically tries to restart the application on the same node. If that fails, it moves the application's resources and restarts them on another node of the server cluster. This process is called failover. The Cluster Administrator can use a graphical console to set various recovery policies, such as dependencies between applications, whether or not to restart an application on the same server, and whether or not to automatically rebalance, or fail back, workloads when a failed server comes back online.

The Cluster service attempts to fail over a group when:

  • The node currently hosting the group becomes inactive for any reason.

  • One of the resources within the group fails, and it is configured to affect the group. For more information on configuring the failure of a resource to affect the group, see View or set resource properties.

  • You force failover. For more information on forcing a resource to fail over, see Initiate a resource failure.

A failover attempt consists of the following steps:

  1. The Cluster service takes all the resources in the group offline in an order determined by the group's dependency hierarchy: dependent resources first, followed by the resources on which they depend. For example, if an application depends on a Physical Disk resource, the Cluster service takes the application offline first, allowing the application to write changes to the disk before the disk is taken offline.

    The Cluster service takes a resource offline by invoking, through a Resource Monitor, the resource DLL that manages the resource. If the resource does not shut down within a specified time limit, the Cluster service forcibly terminates the resource. For more information on setting the time-out value for a resource, see Specify the restart policy for a resource.

  2. When all of the resources are offline, the Cluster service attempts to transfer the group to the node that is listed next on the group's list of preferred host nodes. For more information on how the Cluster service determines which node to failover or failback to, see Determining failover and move policies for groups.

  3. If the Cluster service successfully moves the group to another node, it tries to bring all of the group's resources online. Failover is complete when all of the group's resources are online on the new node.

The Cluster service continues attempting to fail over a group until it succeeds or until a defined number of attempts have been made within a given time span. A group's failover policy specifies the maximum number of failover attempts that can occur in an interval of time. If the Cluster service exceeds this limit, it concludes that the group cannot be brought online anywhere in the cluster and stops trying to fail over the group. For information on how to set group failover policy, see Set group failover policy.

Ways to control failover policy

  • Define how the Cluster service detects and responds to the failure of individual resources in the group. For more information, see Resource failure.

  • Control the order in which the Cluster service takes resources offline by establishing dependency relationships between resources. For more information, see Resource dependencies.

  • Specify time-out, failover threshold, and failover period for resources. The time-out controls how long the Cluster service waits for the resource to shut down. The failover threshold and period control how many times the Cluster service attempts to fail over a resource in a particular period of time. For more information on how to set these values, see Specify the restart policy for a resource.

  • Specify a possible owner list for resources. The possible owner list for a resource controls which cluster nodes are allowed to host the resource. For more information on how to set the possible owner list for a resource, see Specify which nodes can own a resource.

For more information on the Cluster service, see Cluster service.

Failback

When a node becomes inactive for any reason, the Cluster service fails over any groups hosted by the node. When the node becomes active again, the Cluster service can fail back the groups originally hosted by the node.

The Cluster service fails back a group using the same procedures it performs during failover. That is, the Cluster service takes all of the resources in the group offline, moves the group, and then brings all of the resources in the group online.

You can set failback to occur during a specific time period. It is important to set the failback time because you may not want failback to occur during hours of peak usage.

For more information on how to configure a group's failback policy, see Set group failback policy.