Windows Cluster Architecture

Article
07/25/2014

Microsoft Cluster Server (MSCS) in Microsoft Windows NT Server 4.0 Enterprise Edition was the first server cluster technology offered by Microsoft. Individual servers that compose a cluster are referred to as nodes. A Cluster service is a collection of components on each node that perform cluster-specific tasks. Hardware and software components in the cluster that are managed by the Cluster service are referred to as resources. Server clusters provide the instrumentation mechanism for managing resources through resource DLLs, which define resource abstractions (in other words, they abstract a clustered resource from a specific physical node, enabling the resource to move from one node to another), communication interfaces, and management operations.

Resources are elements in a cluster that are:

Brought online (in service) and taken offline (out of service)
Managed in a server cluster
Owned by only one node at a time

A resource group is a collection of resources, managed by the Cluster service as a single, logical unit. This logical unit is often referred to as a failover unit, because the entire group moves as a single unit between nodes. Resources and cluster elements are grouped logically according to the resources added to a resource group. When a Cluster service operation is performed on a resource group, the operation affects all individual resources contained in the group. Typically, a resource group is created that contains the individual resources required by the clustered program.

Cluster resources may include physical hardware devices, such as disk drives and network cards, and logical items such as IP addresses, network names, and application components.

Clusters also include common resources, such as external data storage arrays and private cluster networks. Common resources are accessible by each node in the cluster. One common resource is the quorum resource, which plays a critical role in cluster operations. The quorum resource must be accessible for all node operations, including forming, joining or modifying a cluster.

Server Clusters

Windows Server 2003 Enterprise Edition provides two types of cluster technologies for use with Exchange Server 2003 Enterprise Edition. The first is Cluster services, which provide failover support for back-end mailbox servers that require a high level of availability. The second is Network Load Balancing (NLB), which complements server clusters by supporting highly available and scalable clusters of front-end Exchange protocol virtual servers (for example, HTTP, IMAP4, and POP3).

Server clusters use a shared-nothing model. Model types define how servers in a cluster manage and use local and common cluster devices and resources. In the shared-nothing cluster, each server owns and manages its local devices. Devices common to the cluster, such as common disk arrays and connection media, are selectively owned and managed by only one node at a time.

Server clusters use standard Windows drivers to connect to local storage devices and media. Server clusters support multiple connection media for the external common devices, which must be accessible by all servers in the cluster. External storage devices support standard PCI–based SCSI connections, SCSI over Fibre Channel, and SCSI bus with multiple initiators. Fibre connections are SCSI devices that are hosted on a Fibre Channel bus, instead of on a SCSI bus.

The following figure illustrates components of a two-node server cluster, which is comprised of servers running Windows Server 2003 Enterprise Edition, with shared storage device connections using SCSI or SCSI over Fibre Channel.

Sample two-node Windows cluster

fe1e275f-ae17-433d-a305-14dd3a8c405a

Server Cluster Architecture

Server clusters are designed as separate, isolated sets of components, which work closely together with Windows Server 2003. Modifications to the operating system are enabled when the Cluster service is installed. These modifications include the following:

Support for dynamic creation and deletion of network names and addresses
Modifications to the file system, to enable closing open files during disk drive dismounts
Modifications to the storage subsystem, to enable sharing disks and volumes among multiple nodes

Apart from these and other minor modifications, a server running the Windows Cluster service runs identically to a server that is not running the Windows Cluster service.

Cluster service is at the core of server clusters. Cluster service is comprised of multiple functional units, including Node Manager, Failover Manager, Database Manager, Global Update Manager, Checkpoint Manager, Log Manager, Event Log Replication Manager, and Backup/Restore Manager.

Cluster Service Components

The Cluster service runs on Windows Server 2003 Enterprise Edition, using network drivers, device drivers, and resource instrumentation processes specifically designed for server clusters and their component processes. The Cluster service includes the following components:

Checkpoint Manager This component saves application registry keys in a cluster directory stored on the quorum resource. To make sure that the Cluster service can recover from a resource failure, Checkpoint Manager checks registry keys when a resource is brought online and writes checkpoint data to the quorum resource when a resource goes offline. Checkpoint Manager also supports resources with application-specific registry trees that are instantiated at the cluster node, where the resource comes online. A resource can have one or more registry trees associated with it. When the resource is online, Checkpoint Manager monitors changes to these registry trees. If Checkpoint Manager detects changes, it transfers the registry tree to the owner node of the resource. Checkpoint Manager then transfers the file to the owner node of the quorum resource. Checkpoint Manager performs batch transfers, so that frequent changes to registry trees do not place too heavy a load on the Cluster service.
Database Manager Database Manager maintains cluster configuration information about all physical and logical entities in a cluster. These entities include the cluster itself, cluster node membership, resource groups, resource types, and descriptions of specific resources, such as disks and IP addresses.

Persistent and volatile information stored in the configuration database tracks the current and desired state of a cluster. Each instance of Database Manager running on each node in the cluster cooperates to maintain consistent configuration information across the cluster and to ensure consistency of the configuration database copies on all nodes.

Database Manager also provides an interface for use by other Cluster components, such as Failover Manager and Node Manager. This interface is similar to the registry interface of Microsoft Win32 APIs. However, the Database Manager interface writes changes made to cluster entities in both the registry and in the quorum resource.

Database Manager supports transactional updates of the cluster registry hive and only presents interfaces to internal Cluster service components. Failover Manager and Node Manager typically use this transactional support to get replicated transactions. The Cluster API presents all Database Manager functions to clients, with the exception of transactional support functions. For additional information on the Cluster API, see Cluster API on MSDN.

Note

The application registry key data and changes are recorded by Checkpoint Manager in quorum log files, in the quorum resource.
Event Service Event Service serves as a switchboard, sending events to and from applications, and to the Cluster service components on each node. The Event Processor component of the Event Service helps Cluster service components to disseminate information about important events to all other components. The Event Processor component supports the Cluster API event mechanism. It also performs miscellaneous services, such as delivering signal events to cluster-aware applications and maintaining cluster objects.
Event Log Replication Manager The Event Log Replication Manager replicates event log entries from one node to all other nodes in the cluster. By default, the Cluster service interacts with the Windows Event Log service in the cluster to replicate event log entries to all cluster nodes. When the Cluster service starts on the node, it invokes a private API in the local Event Log service and requests that the Event Log service bind to the Cluster service. The Event Log service then binds to the CLUSAPI interface by using local remote procedure calls (RPCs). When the Event Log service receives an event to be logged, it logs it locally, drops the event into a persistent batch queue, and schedules a timer thread to run within the next 20 seconds, if there is no timer thread that is active already. When the timer threads fires, it processes the batch queue and sends the events, as one consolidated buffer, to the Cluster API interface, where the Event Log service was previously bound. The Cluster API interface then sends the event to the Cluster service.

After the Cluster service receives batched events from the Event Log service, it drops the events into a local outgoing queue and returns from the RPC. The event broadcaster thread, in the Cluster service, then processes this queue and sends the events, using the intra-cluster RPC, to all active cluster nodes. The server side API then drops the events into an incoming queue. An event log writer thread then processes this queue and requests, through a private RPC, that the local Event Log service write the events locally.

The Cluster service uses lightweight remote procedure call (LRPC) to invoke the Event Log service's private RPC interfaces. The Event Log service also uses LRPCs to invoke the Cluster API interface and then request that the Cluster service replicate events.
Failover Manager Failover Manager performs resource management and initiates appropriate actions, such as startup, restart, and failover. Failover Manager stops and starts resources, manages resource dependencies, and initiates failover of resource groups. To perform these actions, Failover Manager receives resource and system state information from Resource Monitors and cluster nodes.

Failover Manager also decides which nodes in the cluster should own which resource group. When resource group arbitration finishes, nodes that own an individual resource group return control of the resources in the resource group to Node Manager. If a node cannot handle a failure of one of its resource groups, Failover Managers on each node work together to reassign ownership of the resource group.

If a resource fails, Failover Manager restarts the resource or takes the resource offline together with its dependent resources. If Failover Manager takes the resource offline, it indicates that the ownership of the resource will be moved to another node. The resource is then restarted, under the ownership of the new node. This is referred to as failover, as explained in the section "Cluster Failover" later in this topic.
Global Update Manager Global Update Manager provides the global update service that is used by cluster components. Global Update Manager is used by internal cluster components, such as Failover Manager, Node Manager, and Database Manager, to replicate changes to the cluster database across nodes. Global Update Manager updates are typically initiated as a result of a Cluster API call. When a Global Update Manager update is initiated at a client node, it first requests a locker node to obtain a global lock. If the lock is not available, the client waits for one to become available.

When the lock is available, the locker grants the lock to the client, and issues the update locally (on the locker node). The client then issues the update to all other healthy nodes, including itself. If an update succeeds on the locker, but fails on some other node, that node will be removed from the current cluster membership. If the update fails on the locker node itself, the locker merely returns the failure to the client.
Log Manager Log Manager writes changes to recovery logs that are stored on the quorum resource. Log Manager, together with Checkpoint Manager, ensures that the recovery log on the quorum resource contains the most recent configuration data and change checkpoints. If one or more cluster nodes are down, configuration changes can still be made to the remaining nodes. While these nodes are down, Database Manager uses Log Manager to log configuration changes to the quorum resource.

When failed nodes return to service, they read the location of the quorum resource from their local cluster registry hives. Because the hive data could be stale, mechanisms are in place to detect invalid quorum resources read from a stale cluster configuration database. Database Manager then requests that Log Manager update the local copy of the cluster hive, using the checkpoint file in the quorum resource. The log file is then replayed in the quorum disk, starting from the checkpoint log sequence number. The result is a completely updated cluster hive. Cluster hive snapshots are taken whenever the quorum log is reset and once every four hours.
**Membership Manager **Membership Manager monitors cluster membership and the health of all nodes in the cluster. Membership Manager (also referred to as the Regroup Engine) maintains a consistent view of which cluster nodes are currently up or down. The core of the Membership Manager component is a regroup algorithm that is invoked whenever there is evidence that one or more nodes failed. At the completion of the algorithm, all participating nodes reach identical conclusions on the new cluster membership.
Node Manager Node Manager assigns resource group ownership to nodes, based on group preference lists and node availability. Node Manager runs on each node and maintains a local list of nodes that belong to the cluster. Periodically, Node Manager sends messages, named heartbeats, to its counterparts running on other nodes in the cluster to detect node failures. All nodes in the cluster must have exactly the same view of cluster membership.

If a cluster node detects a communication failure with another cluster node, it transmits a multicast message to the entire cluster. This regroup event causes all members to verify their view of the current cluster membership. During the regroup event, the Cluster service prevents write operations to any disk devices common to all nodes in the cluster, until the membership stabilizes. If an instance of Node Manager on an individual node does not respond, the node is removed from the cluster, and its active resource groups are moved to another active node. To make this change, Node Manager identifies possible owners (nodes) that may own individual resources and the node on which a resource group prefers to run. Node Manager then selects the node and moves the resource group. In a two-node cluster, Node Manager simply moves resource groups from a failed node to the remaining node. In a cluster comprised of three or more nodes, Node Manager selectively distributes resource groups among the remaining nodes.

Node Manager also acts as a gatekeeper, allowing joiner nodes into the cluster and processing requests to add or evict a node.
Resource Monitor Resource Monitor verifies the health of each cluster resource by using callbacks to resource DLLs. Resource Monitors run a separate process and communicate with Cluster Server through RPCs. This protects the Cluster service from failures of individual cluster resources.

Resource Monitors provide the communication interface between resource DLLs and the Cluster service. When the Cluster service must obtain data from a resource, Resource Monitor receives the request and forwards it to the appropriate resource DLL. Conversely, when a resource DLL must report its status or notify the Cluster service of an event, Resource Monitor forwards the information from the resource to the Cluster service.

The Resource Monitor process (RESRCMON.EXE), is a child process of the Cluster service process (CLUSSVC.EXE). Resource Monitor loads resource DLLs that monitor cluster resources in its process space. Loading the resource DLLs in a process separate from the Cluster service process helps to isolate faults. Multiple Resource Monitors can be instantiated at the same time.

Each Resource Monitor functions as an LRPC server for the Cluster service process. When the Cluster service receives a Cluster API call that requires talking to a resource DLL, it uses the LRPC interface to invoke the Resource Monitor RPC. To receive responses from Resource Monitor, the Cluster service creates one notification thread per Resource Monitor process. This notification thread invokes an RPC that is located permanently in Resource Monitor. The thread acquires notifications when they are generated. The thread is released only when Resource Monitor fails or when the thread is manually stopped by a shutdown command from the Cluster service.

Resource Monitor does not maintain a persistent state on its own. It retains a limited, in-memory state of the resources, but all of its initial state information is supplied by the Cluster service. Resource Monitor communicates with the resource DLLs through well-defined entry points that the DLLs must present. Resource Monitor completes the following operations on its own:
- It polls resource DLLs through the IsAlive and LooksAlive entry points, alternately checking failure events signaled by resource DLLs.
- To monitor pending timeouts of resource DLLs, it spawns timer threads that return ERROR_IO_PENDING from the DLL's Online or Offline entry points.
- It detects crashes of the Cluster service and shuts down the resources.
Its other actions occur as a result of operations requested by the Cluster service through the RPC interface. No hang detection is perfomed by the Cluster service. The Cluster service does, however, monitor crashes, and it restarts a monitor if it detects a process crash.

The Cluster service and Resource Monitor process share a memory-mapped section backed by the paging file. The handle to the section is passed to Resource Monitor at Resource Monitor startup. Resource Monitor then duplicates the handle and records the entry point number and resource name into this section immediately before calling a resource DLL entry point. If Resource Monitor crashes, the Cluster service reads the shared section to detect the resource and the entry point that caused the crash.
Backup/Restore Manager Backup/Restore Manager works with Failover Manager and Database Manager to back up or restore the quorum log file and all checkpoint files. The Cluster service uses the BackupClusterDatabase API for database backup. First, the BackupClusterDatabase API contacts the Failover Manager layer. The Failover Manager layer forwards the request to the node that currently owns the quorum resource. That node then invokes Database Manager, which makes a backup of the quorum log file and all checkpoint files.

The Cluster service also registers itself at startup as a backup writer with Volume Shadow Copy service. When a backup client invokes the Volume Shadow Copy service to perform a system state backup, it also invokes the Cluster service, through a series of entry point calls, to perform the cluster database backup. The server code in the Cluster service invokes the Failover Manager to perform the backup, and the rest of the operation occurs via the BackupClusterDatabase API.

The Cluster service uses the RestoreClusterDatabase API to restore the cluster database from a backup path. This API can only be invoked locally from one of the cluster nodes. When the RestoreClusterDatabase API is invoked, it stops the Cluster service, restores the cluster database from the backup, sets a registry value that contains the backup path, and then re-starts the Cluster service. On startup, the Cluster service detects that a restore is requested and restores the cluster database from the backup path to the quorum resource.

Cluster Failover

Failover can occur automatically because of an unplanned hardware or software failure, or it can occur as the result of manual initiation by an administrator. The algorithm and behavior in both situations is almost identical. However, in a manually initiated failover, resources are shut down in an orderly way; whereas in unplanned failovers, resources are shut down in a sudden and disruptive way (for example, the power goes out, or a crucial hardware component fails).

When an entire node in a cluster fails, its resource groups transfer to one or more available nodes in the cluster. Automatic failover is similar to planned administrative reassignment of resource ownership. However, it is more complicated, because the orderly steps of a planned shutdown might be interrupted or might not have occurred at all. Therefore, extra steps are required to evaluate the state of the cluster at the time of failure.

When your network experiences an automatic failover, it is important to determine what groups were running on the failed node and which nodes should take ownership of the various resource groups. All nodes in the cluster that are capable of hosting the resource groups negotiate for ownership. This negotiation is based on node capabilities, current load, application feedback, the node preference list, or the use of the AntiAffinityClassNames property, which is discussed in the Cluster-Specific Configurations. When negotiation of the resource group is completed, all nodes in the cluster update their databases and track which node owns the resource group.

In clusters with more than two nodes, the node preference list for each resource group can specify a preferred server, plus one or more prioritized alternatives. This enables cascading failover, in which a resource group can survive multiple server failures, each time cascading, or failing over to the next server on its node preference list.

An alternative to automatic failover, is commonly called N+I failover. This method establishes the node preference lists for all cluster groups. The node preference list identifies the standby cluster nodes, to which resources are moved at the first failover. The standby nodes are servers in the cluster that are mostly idle or that have workloads that can be easily pre-empted if a failed server's workload must be moved to the standby node.

Cascading failover assumes that every other server in the cluster has some excess capacity and can absorb a portion of any other failed server's workload. N+I failover assumes, that the +I standby servers are the primary recipients of excess capacity.

Cluster Failback

When a node comes back online, Failover Manager can decide to move one or more resource groups back to the recovered node. This is referred to as failback. The properties of a resource group must have a preferred owner defined to fail back to a recovered or restarted node. Resource groups for which the recovered or restarted node is the preferred owner are moved from the current owner to the recovered or restarted node.

Failback properties of a resource group can include the hours of the day during which failback is allowed and a limit on the number of times failback is attempted. This enables the Cluster service to prevent failback of resource groups during peak processing times or to nodes that have not been correctly recovered or restarted.

Cluster Quorum

Each cluster has a special resource referred to as the quorum resource. A quorum resource can be any resource that does the following:

Provides a means for arbitration leading to membership and cluster state decisions
Provides physical storage to hold configuration information

A quorum log is a configuration database for the entire server cluster. The quorum log contains cluster configuration information, such as the servers that are part of the cluster, the resources that are installed in the cluster, and the state of those resources (for example, online or offline).

The quorum is important in a cluster for the following two reasons:

Consistency A cluster is made up of multiple physical servers acting as a single virtual server. It is critical that each of the physical servers has a consistent view of the cluster configuration. The quorum acts as the definitive repository for all configuration information relating to the cluster. If the Cluster service is unable to access and read the quorum, it cannot start.
Tie-breaking The quorum is used as a tie-breaker to avoid split-cluster scenarios. A split-cluster scenario occurs when all network communication links between two or more cluster nodes fail. If this occurs, the cluster might be split into two or more partitions that cannot communicate with each other. The quorum ensures that cluster resources are brought online on one node only. It does this by allowing the partition that owns the quorum to continue, while the other partitions are evicted from the cluster.

Standard Quorum

As mentioned earlier in this section, the quorum is a configuration database for the Cluster service that is stored in the quorum log file. A standard quorum uses a quorum log file, located on a disk hosted in the shared storage array, which is accessible by all members of the cluster.

Each member connects to the shared storage using SCSI or Fibre Channel. Storage is made up of external hard disks (usually configured as RAID disks) or a SAN, in which logical slices of the SAN are presented as physical disks.

Note

It is important that the quorum uses a physical disk resource, rather than a disk partition, because the entire physical disk resource is moved during failover. Furthermore, it is possible to configure server clusters to use the local hard disk on a server to store the quorum. This type of implementation, referred to as a lone wolf cluster, is supported only for testing and development purposes. Lone wolf clusters should not be used to cluster Exchange 2003 in a production environment because, being singular, they are incapable of providing failover.

Majority Node Set Quorums

From a server cluster perspective, a majority node set (MNS) quorum is a single quorum resource. The data is stored by default on the local disk of each node in the cluster. The MNS resource makes sure that the cluster configuration data, stored on the MNS resource, is consistent across different disks. The MNS implementation provided in Windows Server 2003 uses a directory on each node's local disk to store the quorum data. If the configuration of the cluster changes, that change is reflected across each node's local disk. The change is considered committed, or made persistent, only if the change is made to: (Number of nodes/2) + 1.

The MNS quorum makes sure that most nodes have an up-to-date copy of the data. The Cluster service starts up and brings resources online only if a majority of the nodes that are configured as part of the cluster are up and are running the Cluster service. If the MNS quorum determines that a majority does not exist, the cluster is considered not to have quorum, and the Cluster service waits in a restart loop until more nodes try to join. When a majority or quorum of nodes is available, the Cluster service starts and brings the resources online. Because the up-to-date configuration is written to a majority of the nodes, regardless of node failures, the cluster always guarantees that it has the most current configuration at startup.

If a cluster failure occurs, or if the cluster somehow enters a split-cluster scenario, all partitions that do not contain a majority of nodes are taken offline. This ensures that if there is a partition running that contains a majority of the nodes, it can safely start any resources that are not running on that partition, because it is the only partition in the cluster that is running resources.

Because of the differences in the way the shared disk quorum clusters behave compared to MNS quorum clusters, you must consider carefully when deciding which model to use. For example, if you have only two nodes in your cluster, the MNS model is not recommended. In this instance, failure of one node leads to failure of the entire cluster, because a majority of nodes is impossible.

Majority node set (MNS) quorums are available in Windows Server 2003 Enterprise Edition and Windows Server 2003 Datacenter Edition clusters. The only benefit that MNS clusters provide for Exchange clusters is to eliminate the need for a dedicated disk in the shared storage array on which to store the quorum resource.

Cluster Resources

The Cluster service manages all resource objects using Resource Monitors and resource DLLs. The Resource Monitor interface provides a standard communication interface that enables the Cluster service to initiate resource management commands and obtain resource status data. The Resource Monitor obtains actual command functions and data through resource DLLs. The cluster Service uses resource DLLs to bring resources online, manage their interaction with other resources in the cluster, and monitor their health.

To enable resource management, a resource DLL uses a few simple resource interfaces and properties. Resource Monitor loads a particular resource DLL in its address space, as privileged code running under the SYSTEM account. The SYSTEM account (that is, LocalSystem), is a security principal account that represents the operating system. The Cluster service, which runs under a user security context, uses the SYSTEM account to perform security functions within the operating system.

When resources depend on the availability of other resources to function, these dependencies can be defined by the resource DLL. When a resource is dependent on other resources, the Cluster service brings the dependent resource online only after it brings the resources on which it depends online in the correct sequence.

Resources are taken offline in a similar manner. The Cluster service takes resources offline only after any dependent resources are taken offline. This prevents introducing circular dependencies when loading resources.

Each resource DLL can also define the type of computer and device connection required by the resource. For example, a disk resource may require ownership only by a node that is physically connected to the disk device. Local restart policies and desired actions during failover events can also be defined in the resource DLL.

Cluster Administration

Clusters are managed using Cluster Administrator. Cluster Administrator is a graphical administrator's tool that enables the Cluster.exe command line tool to perform maintenance, monitoring, and failover administration. Server clusters also provide an automation interface. This interface can be used to create custom scripting tools for administering cluster resources, nodes, and the cluster itself. Applications and administration tools, such as Cluster Administrator, can access this interface using RPCs, whether the tool is running on a node in the cluster or on an external computer.

Cluster Formation and Operation

When the Cluster service is installed and running on a server, the server can participate in a cluster. Cluster operations reduce single points of failure and enable high availability of clustered resources. The following sections briefly describe node behavior during cluster creation and operation.

Creating a Cluster

Server clusters include a cluster installation utility that is used to install the cluster software on a server and create a new cluster. To create a new cluster, the utility is run on the computer selected as the first member of the cluster. This first step defines the new cluster by establishing a cluster name, and creating the cluster database and initial cluster membership list.

The next step in creating a cluster is to add the common data storage devices that will be available to all members of the cluster. This establishes the new cluster with a single node and its own local data storage devices and cluster common resources (generally disk or data storage and connection media resources).

The final step in creating a cluster is to run the installation utility on each additional computer that will be a member of the cluster. As each new node is added to the cluster, it automatically receives a copy of the existing cluster database from the original member of the cluster. When a node joins or forms a cluster, the Cluster service updates the node's private copy of the configuration database.

Forming a Cluster

A server can form a cluster if it is running the Cluster service and cannot locate other nodes in the cluster. To form the cluster, a node must be able to acquire exclusive ownership of the quorum resource.

When a cluster is formed, the first node in the cluster contains the cluster configuration database. As each additional node joins the cluster, it receives and maintains its own local copy of the cluster configuration database. The quorum resource stores the most current version of the configuration database as recovery logs. The logs contain node-independent cluster configuration and state data.

During cluster operations, the Cluster service uses the quorum recovery logs to do the following:

Guarantee that only one set of active nodes is allowed to form a cluster
Enable a node to form a cluster only if it can gain control of the quorum resource
Allow a node to join or remain in an existing cluster only if it can communicate with the node that controls the quorum resource

When a cluster is formed, each node in the cluster can be in one of three distinct states. These states are recorded by Event Processor (described below) and replicated by Event Log Manager to other nodes in the cluster. The three Cluster service states are as follows:

Offline The node is not an active member of the cluster. The node and its Cluster service might or might not be running.
Online The node is an active member of the cluster. It adheres to cluster database updates, contributes input into the quorum algorithm, maintains cluster network and storage heartbeats, and can own and run resource groups.
Paused The node is an active member of the cluster. The node adheres to cluster database updates, contributes input into the quorum algorithm, and maintains network and storage heartbeats, but it cannot accept resource groups. It can support only those resource groups that it currently owns. The paused state enables maintenance to be performed. Online and paused states are treated as equivalent states by the majority of the server cluster components.

Joining a Cluster

To join an existing cluster, a server must be running the Cluster service, and it must successfully locate another node in the cluster. After finding another node in the cluster, the joining server must be authenticated for membership in the cluster and must receive a replicated copy of the cluster configuration database.

The process of joining an existing cluster begins when Windows Service Control Manager starts the Cluster service on the node. During the startup process, the Cluster service configures and mounts the node's local data devices. It does not attempt to bring the common cluster data devices online as nodes, because the existing cluster might be using the devices.

To locate other nodes, a discovery process is started. When the node discovers any member of the cluster, it performs an authentication sequence. The first cluster member authenticates the new node and returns a status of success if the new node is successfully authenticated. If authentication is not successful, as when a joining node is not recognized as a cluster member or has an invalid account password, the request to join the cluster is denied.

After successful authentication, the first node online in the cluster checks the copy of the configuration database of the joining node. If it is out-of-date, the cluster node sends the joining server an updated copy of the database. After receiving the replicated database, the node joining the cluster can use it to find shared resources and bring them online as needed.

Leaving a Cluster

A node can leave a cluster when it shuts down or when the Cluster service is stopped. However, a node can also be evicted from a cluster when the node fails to perform cluster operations (such as failure to commit an update to the cluster configuration database).

When a node leaves a cluster, as in a planned shutdown, it sends a ClusterExit message to all other members of the cluster, notifying them that it is leaving. The node does not wait for any responses and immediately proceeds to shut down resources and close all cluster connections. Because the remaining nodes receive this exit message, they do not perform the regroup process to reestablish cluster membership that occurs when a node unexpectedly fails or network communications stop.

Failure Detection

Failure detection and prevention are key benefits provided by server clusters. When a node or application in a cluster fails, server clusters can respond by restarting the failed application or distributing the work from the failed system to remaining nodes in the cluster. Server cluster failure detection and prevention include bi-directional failover, application failover, parallel recovery, and automatic failback.

When the Cluster service detects failures of individual resources or an entire node, it dynamically moves and restarts application, data, and file resources on an available, healthy server in the cluster. This allows resources such as database, file shares, and applications to remain highly available to users and to client applications.

Server clusters are designed with two different failure detection mechanisms:

Heartbeats for detecting node failures Periodically, each node exchanges user datagram protocol-based messages with other nodes in the cluster over the private cluster network. These messages are referred to as the heartbeat. The heartbeat exchange enables each node to check the availability of other nodes and their resources. If a server fails to respond to a heartbeat exchange, the surviving servers initiate failover processes, including ownership arbitration for resources and applications owned by the failed server. Arbitration is performed using a challenge and defense protocol. The node that appears to have failed is given a time window to demonstrate, in any one of several ways, that it is still running correctly and can communicate with the surviving nodes. If the node is unable to respond, it is removed from the cluster.

Failure to respond to a heartbeat message is caused by several events, such as computer failure, network interface failure, network failure, or even periods of unusually high activity. Typically, when all nodes are communicating, the Configuration Database Manager sends global configuration database updates to each node. When a heartbeat exchange failure occurs, Log Manager saves configuration database changes to the quorum resource. This ensures that remaining nodes can access the most recent cluster configuration and local node registry data during the recovery processes.

The failure detection algorithm is very conservative. If the cause of the heartbeat response failure is temporary, it is best to avoid the potential disruption a failover might cause. However, there is no way to know whether the node will respond in another millisecond, or if it suffered a catastrophic failure. Therefore, a failover is initiated after a timeout period.
Resource Monitor and resource DLLs for detecting resource failures Failover Manager and Resource Monitor work together to detect and recover from resource failures. Resource Monitors keep track of resource status by using the resource DLLs to periodically poll resources. Polling involves two steps, a cursory LooksAlive query and a longer, more definitive, IsAlive query. When Resource Monitor detects a resource failure, it notifies Failover Manager and continues to monitor the resource.

Failover Manager maintains resources and resource group status. It also performs recovery when a resource fails and invokes Resource Monitors in response to user actions or failures.

After a resource failure is detected, Failover Manager performs recovery actions that include restarting a resource and its dependent resources, or moving the entire resource group to another node. The recovery action that is taken is determined by resource and resource group properties, in addition to node availability.

During failover, the resource group is treated as the unit of failover. This ensures that resource dependencies are correctly recovered. When a resource recovers from a failure, Resource Monitor notifies Failover Manager. Failover Manager then performs automatic failback of the resource group, based on the configuration of the resource group failback properties.