MS Cluster Server Troubleshooting and Maintenance
Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. |
Published: May 6, 1999
By Martin Lucas, Microsoft Premier Enterprise Support
On This Page
Abstract
Introduction
Chapter 1: Preinstallation
Chapter 2: Installation Problems
Chapter 3: Post-Installation Problems
Chapter 4: Administrative Issues
Chapter 5: Troubleshooting the shared SCSI bus
Chapter 6: Client Connectivity Problems
Chapter 7: Maintenance
Appendix A: MSCS Event messages
Appendix B: Using AND Reading THE Cluster Logfile
Appendix C: Command-Line Administration
For More Information
Abstract
This white paper details troubleshooting and maintenance techniques for Microsoft® Cluster Server version 1.0. Because cluster configurations vary, this document discusses techniques in general terms. Many of these techniques can be applied to different configurations and conditions.
Introduction
This white paper discusses troubleshooting and maintenance techniques for the first implementation of Microsoft® Cluster Server (MSCS) version 1.0. The initial phase of the product supports a maximum of two servers in a cluster, which are often referred to as nodes. Since there are so many different types of resources that may be managed within a cluster, it may be difficult at times for an administrator to determine what component or resource may be causing failures. In many cases, MSCS can automatically detect and recover from server or application failures. However, in some cases, it may be necessary to troubleshoot attached resources or applications.
Clustering and Microsoft Cluster Server (MSCS)
The term clustering has been used for many years within the computing industry. Clustering is a familiar subject to many users, and seems very complicated based on earlier implementations that were large, complex, and sometimes difficult to configure. Earlier clusters were a challenge to maintain without extensive training, and without an experienced administrator.
Microsoft has extended the capabilities of the Microsoft® Windows NT® Server operating system through the Enterprise Edition. Microsoft® Windows NT® Server, Enterprise Edition, contains Microsoft Cluster Server (MSCS). MSCS adds clustering capabilities to Windows NT, to achieve high availability, easier manageability, and greater scalability.
Chapter 1: Preinstallation
MSCS Hardware Compatibility List (HCL)
The picture above shows part of the installation process that mentions the importance of using certified hardware for clusters. MSCS uses industry standard hardware. This allows hardware to be easily added or replaced as needed. Supported configurations will use only hardware validated using the MSCS Cluster Hardware Compatibility Test (HCT). These tests are above and beyond the standard compatibility testing for Microsoft Windows NT, and are quite intensive. Microsoft supports MSCS only when MSCS is used on a validated cluster configuration. Validation is available only for complete configurations as tested together. The MSCS HCL is available on the Microsoft Web site at: https://support.microsoft.com/kb/131900.
Configuring the Hardware
The MSCS installation process relies heavily on properly configured hardware. Therefore, it is important that you configure and test each device before you run the MSCS installation program. A typical cluster configuration consists of two servers, two network adapters each, local storage, and one or more shared SCSI buses with one or more disks. While it is possible to configure a cluster using only one network adapter in each server, you are strongly encouraged to have a second isolated network for cluster communications. For clusters to be certified, they must have at least one isolated network for cluster communications. The cluster may also be configured to use the primary non-isolated network for cluster communications if the isolated network fails. The cluster nodes must communicate with each other on a time-critical basis. Communication between nodes is sometimes referred to as the heartbeat. Because is important that the heartbeat packets be sent and received in a timely manner, only PCI-based network adapters should be used, because the PCI bus has the highest priority.
Figure 1:
The shared SCSI bus consists of a compatible PCI SCSI adapter in each server, with both systems connected to the same SCSI bus. One SCSI host adapter uses the default ID 7, and the other uses ID 6. This ensures that the host adapters have the highest priority on the SCSI bus. The bus is referred to as the Shared SCSI bus, because both systems share exclusive access to one or more disk devices on the bus. MSCS controls exclusive access to the device through the reserve and release commands in the SCSI specification.
Other storage subsystems may be available from system vendors as an alternative to SCSI, which, in some cases, may offer additional speed or flexibility. Some of these storage types may require installation procedures other than those specified in the Microsoft Cluster Server Administrator's Guide. These storage types may also require special drivers or resource DLLs as provided by the manufacturer. If the manufacturer provides installation procedures for Microsoft Cluster Server, use those procedures instead of the generic installation directions provided in the Administrator's Guide.
Installing the Operating System
Before you install Microsoft Windows NT Server, Enterprise Edition, you must decide what role each computer will have in the domain. As the Administrator's Guide indicates, you may install MSCS as a member server or as a domain controller. The following information focuses on performance issues with each configuration:
The member server role for each cluster node is a viable solution, but may have a few drawbacks. While not incurring overhead from performing authentication for other systems within the domain, this configuration remains vulnerable to loss of communication with domain controllers on the network. Node to node communications and various registry operations within the cluster require authentication from the domain. At times, during normal operations, the need to receive authentication may occur. Member servers rely on domain controllers elsewhere on the network for this type of authentication. Lack of connectivity with a domain controller may severely affect performance, and may also cause one or more cluster nodes to stop responding until connection with a domain controller has been re-established. In a worst case scenario, loss of network connectivity with domain controllers may cause complete failure of the cluster.
The primary domain controller to backup domain controller (PDC to BDC) configuration is a better alternative than the member server option, because it removes the need for the cluster node to be authenticated by an external source. If an activity requires authentication, either of the nodes can supply it. Thus, authentication is not a failure point as it is in the member server configuration. However, primary domain controllers may require special configuration in a multihomed environment. Additionally, the domain overhead may not be well distributed in this model because one node may have more domain activity than the other one.
The BDC to BDC configuration is the most favorable configuration, because it provides authentication, regardless of public network status, and, the overhead associated with domain activities is balanced between the nodes. Additionally, BDCs are easier to configure in a multihomed environment.
Configuring Network Adapters
In a typical MSCS installation, each server in the cluster, referred to as nodes, will have at least two network adapters; one adapter configured as the public network for client connections, the other for private communications between cluster nodes. This second interface is called the cluster interconnect. If the cluster interconnect fails, MSCS (if so configured) will automatically attempt to use the public network for communication between cluster nodes. In many two-node installations, the private network uses a crossover cable or an isolated segment. It is important to restrict network traffic to only cluster communications on this interface. Additionally, each server should use PCI network adapters. If you have any ISA, PCMCIA, or other bus architecture network adapters, these adapters may compete for attention of the CPU in relationship to other faster PCI devices in the system. Network adapters other than PCI may cause premature failover of cluster resources, based on delays induced by the hardware. Complete systems will likely not have these types of adapters. Keep this in mind, if you decide to add adapters to the configuration.
Follow standard Windows NT configuration guidelines for network adapter configuration. For example, each network adapter must have an IP address that is on a different network or subnet. Do not use the same IP address for both network adapters, although they are connected to two distinctly different physical networks. Each adapter must have a different address, and the addresses cannot be on the same network. Consider the table of addresses in Figure 2 below.
Adapter 1 (Public Network) |
Adapter 2 (Private Network) |
Valid Combination? |
---|---|---|
192.168.0.1 |
192.168.0.1 |
NO |
192.168.0.1 |
192.168.0.2 |
NO |
192.168.0.1 |
192.168.1.1 |
YES |
192.168.0.1 |
10.0.0.1 |
YES |
Figure 2
In fact, because of isolation of the private network, you can use just about whatever matching IP address combination you like for this network. If you want to, you can use addresses that the Internet Assigned Numbers Authority (IANA) designates for private use. The private use address ranges are noted in Figure 3.
Address Class |
Starting Address |
Ending Address |
---|---|---|
Class A |
10.0.0.0 |
10.255.255.255 |
Class B |
172.16.0.0 |
172.31.255.255 |
Class C |
192.168.0.0 |
192.68.255.255 |
Figure 3
The first and last addresses are designated as the network and broadcast addresses for the address range. For example, on the reserved Class C address, the actual range for host addresses is 192.168.0.1 through 192.68.255.254. Use 192.168.0.1 and 192.168.0.2 to keep it simple, because you'll have only two adapters on this isolated network. Do not declare default gateway and WINS server addresses for this network. You may need to consult with your network administrator on use of these addresses, in the event that they may already be in use within your enterprise.
When you've obtained the proper addresses for network adapters in each system, use the Network utility in Control Panel to set these options. Use the PING utility from the command prompt to check each network adapter for connectivity with the loopback address (127.0.0.1), the card's own IP address, and the IP address of another system. Before you attempt to install MSCS, make sure that each adapter works properly and can communicate properly on each network. You will find more information on network adapter configuration in the Windows NT Online documentation, the Windows NT Server 4.0 Resource Kit, or in the Microsoft Knowledge Base.
The following are related Microsoft Knowledge Base articles regarding network adapter configuration, TCP/IP configuration, and related troubleshooting:
164015 |
Understanding TCP/IP Addressing and Subnetting Basics |
102908 |
How to Troubleshoot TCP/IP Connectivity with Windows NT |
151280 |
TCP/IP Does Not Function After Adding a Second Adapter |
174812 |
Effects of Using Autodetect Setting on Cluster NIC |
175767 |
Expected Behavior of Multiple Adapters on Same Network |
170771 |
Cluster May Fail If IP Address Used from DHCP Server |
168567 |
Clustering Information on IP Address Failover |
193890 |
Recommended Wins Configuration for MSCS |
217199 |
Static Wins entries cause the Network Name to go offline. |
201616 |
Network card detection in Microsoft Cluster Server |
Configuring the Shared SCSI Bus
In a normal configuration with a single server, the server has a SCSI host adapter that connects directly to one or more SCSI devices, and each end of the SCSI bus has a bus terminator. The terminators help stabilize the signals on the bus and help ensure high-speed data transmission. They also help eliminate line noise.
Configuring Host Adapters
The shared SCSI bus, as used in a Microsoft cluster, differs from most common SCSI implementations in one way: the shared SCSI bus uses two SCSI host adapters. Each cluster node has a separate SCSI host adapter for shared access to this bus, in addition to the other disk controllers that the server uses for local storage (or the operating system). As with the SCSI specification, each device on the bus must have a different ID number. Therefore, the ID for one of these host adapters must be changed. Typically, this means that one host adapter uses the default ID of 7, while the other adapter uses ID 6.
Note: It is important to use ID 6 and 7 for the host adapters on the shared bus so that they have priority over other connected devices on the same channel. A cluster may have more than one shared SCSI bus as needed for additional shared storage.
SCSI Cables
SCSI bus failures can be the result of reduced quality cables. Inexpensive cables may be attractive because of the low price, but may not be worth the headache associated with them. An easy comparison between the cheaper cables and the expensive ones can be done by holding a cable in each hand, about 10 inches from the connector. Observe the arc of the cable. The higher quality cables don't bend very well in comparison. These cables use better shielding than the other cables, and may use different gauge wire. If you use the less expensive cables, you may spend more supporting them than it would cost to buy the better quality cables in the first place. This shouldn't be much of a concern for complete systems purchased from a hardware vendor. These certified systems likely have matched cable sets. In the event you ever need to replace one of these cables, consult with your hardware vendor.
Some configurations may use standard SCSI cables, while others may use Y cables (or adapters). The Y cables are recommended for the shared SCSI bus. These cables allow bus termination at each end, independent of the host adapters. Some adapters do not continue to provide bus termination when turned off, and also cannot maintain bus termination if they are disconnected for maintenance. Y cables avoid these points of failure and help achieve high availability.
Even with high quality cables, it is important to consider total cable length. Transfer rate, the number of connected SCSI devices, cable quality, and termination may influence the total allowable cable length for the SCSI bus. While it is common knowledge that a standard SCSI bus using a 5-megabit transfer rate may have a maximum total cable length of approximately 6 meters, the maximum length decreases as the transfer rate increases. Most SCSI devices on the market today achieve much higher transfer rates and demand a shorter total cable length. Some manufacturers of complete systems that are certified for MSCS may use differential SCSI with a maximum total cable length of 25 meters. Consider these implications when adding devices to an existing bus or certified system. In some cases, it may be necessary to install another shared SCSI bus.
SCSI Termination
Microsoft recommends active termination for each end of the shared SCSI bus. Passive terminators may not reliably maintain adequate termination under certain conditions. Be sure to have an active terminator at each end of the shared SCSI bus. A SCSI bus has two ends and must have termination on each end. For best results, do not rely on automatic termination provided by host adapters or newer SCSI devices. Avoid duplicate termination and avoid placing termination in the middle of the bus.
Drives, Partitions, and File Systems
Whether you use individual SCSI disk drives on the shared bus, shared hardware RAID arrays, or a combination of both, each disk or logical drive on the shared bus needs to be partitioned and formatted before you install MSCS. The Microsoft Cluster Server Administrator's Guide covers the necessary steps to perform this procedure. In most cases, a drive contains only one partition. Some RAID controllers can partition arrays as multiple logical drives, or as a single large partition. In the case of a single large partition, you will probably prefer to have a few logical drives for your data: one drive or disk for each group of resources, with one drive designated as the quorum disk.
If you partition drives at the operating system level into multiple partitions, remember that all partitions on shared disks move together from one node to another. Thus, physical drives are exclusively owned by one node at a time. In turn, all partitions on a shared disk are owned by one node at a time. If you transfer ownership of a drive to another node through MSCS, the partitions move in tandem, and may not be split between nodes. Any partitions on shared drives must be formatted with the NTFS file system, and must not be members of any software-based fault tolerant sets.
CD-ROM Drives and Tape Drives
Do not connect CD-ROM drives, tape drives, or other non-physical disk devices to the shared SCSI bus. MSCS version 1.0 only supports non-removable physical disk drives that are listed on the MSCS HCL. The cluster disk driver may or may not recognize other device types. If you attach unsupported devices to the shared bus, the unsupported devices may appear usable by the Windows NT operating system. However, because of SCSI bus arbitration between the two systems and the use of SCSI resets, these devices may experience problems if attached to the shared SCSI bus. These devices may also create issues for other devices on the bus. For best results, attach the noncluster devices to a separate controller not used by the cluster.
Preinstallation Checklist
Before you install MSCS, there are several items to check to help ensure proper operation and configuration. After proper configuration and testing, most installations of MSCS should complete without error. The following checklist is fairly general. It may not include all possible system options that you need to evaluate before installation:
Use only certified hardware as listed on the MSCS Hardware Compatibility List (HCL).
Determine which role these servers will play in the domain. Will each server be a domain controller or a member server? Recommended role: backup domain controller (BDC).
Install Microsoft Windows NT Server, Enterprise Edition, on both servers.
Install Service Pack 3 on each server.
Verify cables and termination of the shared SCSI bus.
Check drive letter assignment and NTFS formatting of shared drives with only one server turned on at a time.
If both systems have ever been allowed to access drives on the shared bus at the same time (without MSCS installed), the drives must be repartitioned and reformatted prior to the next installation. Failure to do so may result in unexpected file system corruption.
Ensure only physical disks or hardware raid arrays are attached to the shared SCSI bus.
Make sure that disks on the shared SCSI bus are not members of any software fault tolerance sets.
Check network connectivity with the primary network adapters on each system.
Evaluate network connectivity on any secondary network adapters that may be used for private cluster communications.
Ensure that the system and application event logs are free of errors and warnings.
Make sure that each server is a member of the same domain, and that you have administrative rights to each server.
Ensure that each server has a properly sized pagefile and that the paging files reside only on local disks. Do not place pagefiles on any drives attached to the shared SCSI bus.
Determine what name you will use for the cluster. This name will be used for administrative purposes within the cluster and must not conflict with any existing names on the network (computer, server, printer, domain, and so forth). This is not a network name for clients to attach to.
Obtain a static IP address and subnet mask for the cluster. This address will be associated with the cluster name. You may need additional IP addresses later for groups of resources (virtual servers) within the cluster.
Set multi-speed network adapters to a specific speed. Do not use the autodetect setting if available. For more information, see the Microsoft Knowledge Base article 174812.
Decide the name of the folder and location for cluster files to be stored on each server. The default location is %WinDir%\Cluster, where %WinDir% is your Windows NT folder.
Determine what account the cluster service (ClusSvc) will run under. If you need to create a new account for this purpose, do so before installation. Make the domain account a member of the local Administrators group. Though the Domain Admins group may be a member of the Administrators group, this is not sufficient. The account must be a direct member of the Administrators group. Do not place any password restrictions on the account. Also ensure the account should also have the Logon as a service and Lock pages in memory rights.
Installation on systems using custom disk hardware
If your hardware uses other than standard SCSI controllers and requires special drivers and custom resource types, use the software and installation instructions as provided by the manufacturer. Use of the standard installation procedures for MSCS will fail on these systems as they require additional device drivers and DLLs as supplied by the manufacturer. These systems also require special cabling.
Chapter 2: Installation Problems
The installation process for Microsoft Cluster Server (MSCS) is very simple compared to other network server applications. The MSCS installation completes within a short timeframe. Installation usually lasts just a few minutes. For a software package that does so much, the speed with which MSCS installs might surprise you. In reality, MSCS is more complex behind the scenes, and installation depends greatly on the compatibility and proper configuration of the system hardware and networks. If the hardware configuration is not acceptable, it is not unusual to expect installation problems. After installation, be sure to evaluate the proper operation of the entire cluster prior to installing additional software.
MSCS Installation Problems with the First Node
Is Hardware Compatible?
It is important to use certified systems for MSCS installations. Use systems and components from the MSCS Hardware Compatibility List (HCL). For many, the main reason for installing a cluster is to achieve high availability of their valuable resources. Why compromise availability by using unsupported hardware? Microsoft supports only MSCS installations that use certified complete systems from the MSCS Hardware Compatibility List. If the system fails and you need support, if the hardware isn't supported, high availability may be compromised.
Is the Shared SCSI Bus Connected and Configured Properly?
MSCS relies heavily on the shared SCSI bus. You must have at least one device on the shared bus for the cluster to store the quorum logfile and act as the cluster's quorum disk. Access to this disk is vital to the cluster. In the event of a system failure or loss of network communication between nodes, cluster nodes will arbitrate for access to the quorum disk to determine which system will take control and make decisions. The quorum logfile holds information regarding configuration changes made within the cluster when another node may be offline or unreachable. The installation process requires at least one device on the shared bus for this purpose. A hardware RAID logical partition or separate physical disk drive will be sufficient to store the quorum logfile and function as the quorum disk..
To check proper operation of the shared SCSI bus, consult the section "Troubleshooting shared SCSI bus" later in this document.
Install Windows NT Server, Enterprise Edition, and Service Pack 3
MSCS version 1.0 requires Microsoft Windows NT Server, Enterprise Edition, version 4.0 with Service Pack 3 or later. If you add network adapters or other hardware devices and drivers later, it's important to reapply the service pack to ensure that all drivers, DLLs, and system components are of the same version. Hotfixes may require reapplication if they are overwritten. Check with Microsoft Product Support Services or the Microsoft Knowledge Base regarding applied hotfixes, and to determine whether the hotfix needs to be reapplied.
Does the System Disk Have Adequate Free Space to Install the Product?
MSCS requires only a few megabytes to store files on each system. The Setup program prompts for the path to store these files. The path should be to local storage on each server, not to a drive on the shared SCSI bus. Make sure that free space exists on the system disk, both for installation requirements and for normal system operation.
Does the Server Have a Properly Sized System Paging File?
If you've experienced reduced system performance or near system lockup during the installation process, check the Performance tab using the System utility of the Control Panel. Make sure the system has acceptable paging file space (the minimum space required is the amount of physical RAM plus 11 MB.), and that the system drive has enough free space to hold a memory dump file, should a system crash occur. Also, make sure pagefiles are on local disks only, not on shared drives. Performance Monitor may be a valuable resource for troubleshooting virtual memory problems.
Do Both Servers Belong to the Same Domain?
Both servers in the cluster must have membership in the same domain. Also, the service account that the cluster service uses must be the same on both servers. Cluster nodes may be domain controllers or domain member servers. However, if functioning as a domain member server, a domain controller must be accessible for cluster service account authentication. This is a requirement for any service that starts using a domain account.
Is the Primary Domain Controller (PDC) Accessible?
During the installation process, Setup must be able to communicate with the PDC. Otherwise, the setup process will fail. Additionally, after setup, the cluster service may not start if domain controllers are unavailable to authenticate the cluster service account. For best results, make sure each system has connectivity with the PDC, and install each node as a backup domain controller in the same domain.
Are You Installing While Logged On as an Administrator?
To install MSCS, you must have administrative rights on each server. For best results, log on to the server with an administrative account before you start Setup.
Do the Drives on the Shared SCSI Bus Appear to Be Functioning Properly?
Devices on the shared SCSI bus must be turned on, configured, and functioning properly. Consult the Microsoft Cluster Server Administrator's Guide for information on testing the drives before setup.
Are Any Errors Listed in the Event Log?
Before you install new software of any kind, it is good practice to check the system and application event logs for errors. This resource can indicate the state of the system before you make configuration changes. Events may be posted to these logs in the event of installation errors or hardware malfunctions during the installation process. Attempt to correct any problems you find. Appendix A of this document contains information regarding some events that may be related to MSCS and possible resolutions.
Is the Network Configured and Functioning Properly?
MSCS relies heavily on configured networks for communications between cluster nodes, and for client access. With improper function or configuration, the cluster software cannot function properly. The installation process attempts to validate attached networks and needs to use them during the process. Make sure that the network adapters and TCP/IP protocol are configured properly with correct IP addresses. If necessary, consult with your network administrator for proper addressing.
For best results, use statically assigned addresses and do not rely on DHCP to supply addresses for these servers. Also, make sure you're using the correct network adapter driver. Some adapter drivers may appear to work, because they are similar enough to the actual driver needed but are not an exact match. For example, an OEM or integrated network adapter may use the same chipset as a standard version of the adapter. Use of the same chipset may cause the standard version of the driver to load instead of an OEM supplied driver. Some of these adapters work more reliably with the driver supplied by the OEM, and may not attain acceptable performance if using the standard driver. In some cases, this combination may prevent the adapter from functioning at all, even though no errors appear in the system event log for the adapter.
Cannot Install MSCS on the Second Node
The previous section, "MSCS Installation Problems with the First Node," contains questions you need to ask if installation on the second node fails. Please consult this section first, before you continue with additional troubleshooting questions in this section.
During Installation, Are You Specifying the Same Cluster Name to Join ?
When you install the second node, select the Join an Existing Cluster option. The first node you installed must be running at the time with the cluster service running.
Is the RPC Service Running on Both Systems?
MSCS uses remote procedure calls (RPC) and requires that the RPC service be running on both systems. Check to make sure that the RPC service is running on both systems and that the system event logs on each server do not have any RPC-related errors.
Can Each Node Communicate with One Another Over Configured Networks?
Evaluate network connectivity between systems. If you used the procedures in the preinstallation section of this document, then you've already covered the basics. During installation of the second node, the installation progam communicates through the server's primary network and through any other networks that were configured during installation of the first node. Therefore, you should test connectivity again with the IP addresses on these adapters. Additionally, the cluster name and associated IP address you configured earlier will be used. Make sure the cluster service is running on the first node and that the cluster name and cluster IP address resources are online and available. Also, make sure that the correct network was specified for the cluster IP address when the first node was installed. The cluster service may be registering the cluster name on the wrong network. The cluster name resource should be registered on the network that clients will use to connect to the cluster.
Are Both Nodes Connected to the Same Network or Subnet?
Both nodes need to use unique addresses on the same network or subnet. The cluster nodes need to be able to communicate directly, without routers or bridges between them. If the nodes are not directly connected to the same public network, it will not be possible to failover IP addresses.
Cannot Reinstall MSCS After Node Evicted
If you evict a node from the cluster, it may no longer participate in cluster operations. If you restart the evicted node and have not removed MSCS from it, the node will still attempt to join, and cluster membership will be denied. You must remove MSCS with the Add/Remove Programs utility in Control Panel. This action requires that you restart the system. If you ignore the option to restart, and attempt to reinstall the software anyway, you may receive the following error message:
If you receive this message, restart the affected system and reinstall the MSCS software to join the existing cluster.
Chapter 3: Post-Installation Problems
As you troubleshoot or perform cluster maintenance, it may be possible to keep resources available on one of the two nodes. If you are able to use at least one of the nodes for resources while troubleshooting, you may be able to keep as many resources available to users during administrative activity. In some cases, it may be desirable to run with some unavailable resources rather than none at all.
The most likely causes for one or all nodes to be down are usually related to the shared SCSI bus. If only one node is down, check for SCSI-related problems or for communication problems between the nodes. These are the most likely sources of problems that lead to node failures.
Entire Cluster Is Down
If the entire cluster is down, try to bring at least one node online. If you can achieve this goal, the affect on users may be substantially reduced. When a node is online, gather event log data or other information that may be helpful to troubleshoot the failure. Check for the existence of a recent Memory.dmp file that may have been created from a recent crash. If necessary, contact Microsoft Product Support Services for assistance with this file.
One Node Is Down
If a single node is unavailable, make sure that resources and groups are available on the other node. If they are, begin troubleshooting the failed node. Try to bring it up and gather error data from the event log or cluster diagnostic logfile.
Applying Service Packs and Hotfixes
If you're applying service packs or hotfixes, avoid applying them to both nodes at one time, unless otherwise directed by release notes, KB articles, or other instructions. It may be possible to apply the updates to a single node at a time to avoid rendering both nodes unavailable for a short or long duration. More information on this topic may be found in Microsoft Knowledge Base article 174799, "How to Install Service Packs in a Cluster."
One or More Servers Quit Responding
If one or more servers are not responding but have not crashed or otherwise failed, the problem may be related to configuration, software, or driver issues. You can also check the shared SCSI bus or connected disk devices.
If the servers are installed as member servers (non-domain controllers), it is possible that one or both nodes may stop responding if connectivity with domain controllers becomes unavailable. Both the cluster service and other applications use remote procedure calls (RPCs). Many RPC-related operations require domain authentication. As cluster nodes must participate in domain security, it is necessary to have reliable domain authentication available. Check network connectivity with domain controllers and for other network problems. To avoid this potential problem, it is preferred that the nodes be installed as backup domain controllers (BDC). The BDC configuration allows each node to perform authentication for itself despite problems that could exist on a wide area network (WAN).
Cluster Service Will Not Start
There are a variety of conditions that could prevent the Cluster Service (ClusSvc) from starting. Many of these conditions may be the result of configuration or hardware related problems. The first things to check when diagnosing this condition are the items on which the Cluster Service depends . Many of these items may be referenced in the Chapter 1: section of this document. Common causes for this problem with error messages are noted below.
Check the service account under which ClusSvc runs. This domain account needs to be a member of the local adminstrators group on each server. The account needs the Logon as a service and Lock pages in memory rights. Make sure the account is not disabled and that password expiration is not a factor. If the failure is because of a problem related to the service account, the Service Control Manager (SCM) will not allow the service to load, much less run. As a result, if you've enabled diagnostic logging for the Cluster Service, no new entries will be written to the log, and a previous logfile may exist. Failures related to the service account may result in Event ID 7000, or in Event ID 7013 errors in the event log. In addition, you may receive the following pop-up error message:
Could not start the Cluster Service on \\computername. Error 1069: The service did not start because of a logon failure.
Check to make sure the quorum disk is online and that the shared SCSI bus has proper termination and proper function. If the quorum disk is not accessible during startup, the following popup error message may occur:
Could not start the Cluster Service on \\computername. Error 0021: The device is not ready.
Also, if diagnostic logging for the Cluster Service is enabled, the logfile entries may indicate problems attaching to the disk. See Appendix B for more information and a detailed example of the logfile entries for this condition, Example 1: Quorum Disk Turned Off.
If the Cluster Service is running on the other cluster node, check the cluster logfile (if it is enabled) on that system for indications of whether or not the other node attempted to join the cluster. If the cluster logfile did try to join the cluster, and the request was denied, the logfile may contain details of the event. For example, if you evict a node from the cluster, but do not remove and reinstall MSCS on that node, when the server attempts to join the cluster, the request to join will be denied. The following are sample error messages and event messages:
Could not start the Cluster Service on \\computername. Error 5028: Size of job is %1 bytes.
Event ID 1009, Event ID 1063, Event ID 1069, Event ID 1070, Event ID 7023
For examples of logfile entries for this type of failure, see the Example 4: Evicted Node Attempts to Join Existing Cluster section in Appendix B of this document.
If the Cluster Service won't start, check the event log for Event 7000 and 7013. These events may indicate a problem authenticating the Cluster Service account. Make sure the password specified for the Cluster Service account is correct. Also make sure that a domain controller is available to authenticate the account, if the servers are non-domain controllers.
Cluster Service Starts but Cluster Administrator Won't Connect
If the Services utility in Control Panel indicates that the service is running, and you cannot connect with Cluster Administrator to administer the cluster, the problem may be related to the Cluster Network Name or to the cluster IP address resources. There may also be RPC-related problems. Check to make sure the RPC Service is running on both nodes. If it is, try to connect to a known running cluster node by the computer name. This is probably the best name to use when troubleshooting to avoid RPC timeout delays during failover of the cluster group. If running Cluster Administrator on the local node, you may specify a period (.) in place of the name when prompted. This will create a local connection and will not require name resolution.
If you can connect through the computer name or ".", check the cluster network name and cluster IP address resources. Make sure that these and other resources in the cluster group are online. These resources may fail if a duplicate name or IP address on the network conflicts with either of these resources. A duplicate IP address on the network may cause the network adapter to shut down. Check the system event log for errors.
Examples of logfile entries for this type of failure may be found in the Example 3: Duplicate Cluster IP Address section in Appendix B of this document.
Group/Resource Failover Problems
The typical reason that a group may not failover properly is usually because of problems with resources within the group. For example, if you elect to move a group from one node to another, the resources within the group will be taken offline, and ownership of the group will be transferred to the other node. On receiving ownership, the node will attempt to bring resources online, according to dependencies defined for the resources. If resources fail to go online, MSCS attempts again to bring them online. After repeated failures, the failing resource or resources may affect the group and cause the group to transition back to the previous node. Eventually, if failures continue, the group or affected resources may be taken offline. You can configure the number of attempts and allowed failures through resource and group properties.
When you experience problems with group or resource failover, evaluate which resource or resources may be failing. Determine why the resource won't go online. Check resource dependencies for proper configuration and make sure they are available. Also, make sure that the "Possible Owners" list includes both nodes. The "Preferred Owners" list is designed for automatic failback or initial group placement within the cluster. In a two-node cluster, this list should only contain the name of the preferred node for the group, and should not contain multiple entries.
If resource properties do not appear to be part of the problem, check the event log or cluster logfile for details. These files may contain helpful information related to the resource or resources in question.
Physical Disk Resource Problems
Problems with physical disk resources are usually hardware related. Cables, termination, or SCSI host adapter configuration may cause problems with failover, or may cause premature failure of the resource. The system event log may often show events related to physical disk or controller problems. However, some cable or termination problems may not yield such helpful information. It is important to verify the configuration of the shared SCSI bus and attached devices, whenever you detect trouble with one of these devices. Marginal cable connections or cable quality can cause intermittent failures that are difficult to troubleshoot. BIOS or firmware problems might also be factors.
Quorum Resource Failures
If the Cluster Service won't start because of a quorum disk failure, check the corresponding device. If necessary, use the -fixquorum startup option for the Cluster Service, to gain access to the cluster and redesignate the quorum disk. This process may be necessary if you replace a failed drive, or attempt to use a different device in the interim. To view or change the quorum drive settings, right-click the cluster name at the top of the tree, listed on the left portion of the Cluster Administrator window, and select Properties. The Cluster Properties window contains three different tabs, one of which is for the quorum disk. From this tab, you may view or change quorum disk settings. You may also re-designate the quorum resource. More information on this topic may be found in Microsoft Knowledge Base article 172944, "How to Change Quorum Disk Designation."
Failures of the quorum device while the cluster is in operation are usually related to hardware problems, or to configuration of the shared SCSI bus. Use troubleshooting techniques to evaluate proper operation of the shared SCSI bus and attached devices.
File Share Won't Go Online
For a file share to reach online status, the dependent resources must exist and be online. The path for the share must exist. Permissions on the file share directory must also include at least Read access for the Cluster Service account.
Problems Accessing Drive
If you attempt to access a shared drive through the drive letter, it is possible that you may receive the Incorrect Function error. The error may be a result of the drive not being online on the node you're accessing it from. The drive may be owned by another cluster node and may be online. Check Cluster Administrator for ownership of the resource and online status. If necessary, consult the Physical Disk Resource Problems section of this document. The error could also indicate drive or controller problems.
Chapter 4: Administrative Issues
Cannot Connect to Cluster Through Cluster Administrator
If you try to administer the cluster from a remote workstation, the most common way to do so would be to use the network name you defined during the setup process as Cluster Name. This resource is located in the Cluster Group. Cluster Administrator needs to establish a connection using RPC. If the RPC service has failed on the cluster node that owns the Cluster Group, it will not be possible to connect through the Cluster Name or the name of the computer. Try to connect, instead, using the computer names of each cluster node. If this works, this indicates a problem with either the IP address or Network Name resources in the Cluster Group. There may also be a name resolution problem on the network that may prevent access through the Cluster Name.
Failure to connect using the Cluster Name or computer names of either node may indicate problems with the server, with RPC connectivity, or with security. Make sure that you are logged on with an administrative account in the domain, and that the account has access to administer the cluster. Access may be granted to additional accounts by using Cluster Administrator on one of the cluster nodes. For more information on controlling administrative access to the cluster, see "Specifying Which Users Can Administer a Cluster" in the MSCS Administrator's Guide.
If Cluster Administrator cannot connect from the local console of one of the cluster nodes, check to see if the Cluster Service is started. Check the system event log for errors. You may want to enable diagnostic logging for the Cluster Service. If the problem occurs after recently starting the system, wait 30 to 60 seconds for the Cluster Service to start, and then try to run Cluster Administrator again.
Cluster Administrator Loses Connection or Stops Responding on Failover
The Cluster Administrator application uses RPC communications to connect with the cluster. If you use the Cluster Name to establish the connection, Cluster Administrator may appear to stop responding during a failover of the Cluster Group and its resources. This normal delay occurs during the registration of the IP address and network name resources within the group, and the establishment of a new RPC connection. If a problem occurs with the registration of these resources, the process may take extended time until these resources become available. The first RPC connection must timeout before the application attempts to establish another connection. As a result, Cluster Administrator may eventually time out if there are problems bringing the IP address or network name resources online within the Cluster Group. In this situation, try to connect using the computer name of one of the cluster nodes, instead of the cluster name. This usually allows a more real-time display of resource and group transitions without delay.
Cannot Move a Group
To move a group from one node to another, you must have administrative rights to run Cluster Administrator. The destination node must be online and the cluster service started. The state of the node must be online and not Paused. In a paused state, the node is a fully active member in the cluster, but cannot own or run groups.
Both cluster nodes should be listed in the Possible Owners list for the resources within the group; otherwise the group may only be owned by a single node and will not fail over. While in some configurations, this restriction may be intentional, in most it would be a mistake as it would prevent the entire group from failing over. Also, to move a group, resources within the group cannot be in a pending state. To initiate a Move Group request, resources must be in one of the following three states: online, offline, or failed.
Cannot Delete a Group
To properly delete a group from the cluster, the group must not contain resources. You may either delete the resources contained within the group, or move them to another group in the cluster.
Problems Adding, Deleting, or Moving Resources
Adding Resources
Resources are usually easy to add. However, it is important to understand various resource types and their requirements. Some resource types may have prerequisites for other resources that must exist within the same group. As you work with MSCS, you may become more familiar with these and what they are. You may find that a resource depends on one or more resources within the same group. Examples might include IP addresses, network names, or physical disks. The resource wizard will typically indicate mandatory requirements for other resources. However, in some cases, it may be a good idea to add resources to the dependency list, as they are related. While Cluster.exe may allow the addition of resources and groups, the command-line utility does not impose the dependency or resource property constraints like the Cluster Administrator, because these activities may consist of multiple commands.
For example, suppose you want to create a network name resource in a new group. If you try to create the network name resource first, the wizard will indicate that it depends on an IP address resource. The wizard lists available resources in the group from which you select. If this is a new group, the list may be empty. Therefore, you will need to create the required IP address resource before you create the network name.
If you create another resource in the group and make it dependent on the network name resource, the resource will not go online without the network name resource in an online state. A good example might be a File Share resource. Thus, the share will not be brought online until the network name is online. Because the network name resource depends on an IP address resource, it would be repetitive to make the share also dependent on the same IP address. The established dependency with the network name implies a dependency on the address. You can think of this as a cascading dependency.
You might ask, "What about the disk where the data will be? Shouldn't the share depend on the existence or online status of the disk?" Yes, you should create a dependency on the physical disk resource, although this dependency is not required. If the resource wizard did impose this requirement, it would imply that the only data source that could be used for a file share would be a physical disk resource on the shared SCSI bus. For volatile data, shared storage is the way to go, and a dependency should be created for it. This way, if the disk experiences a momentary failure, the share will be taken offline and restored when the disk becomes available. However, without a requirement for dependency on a physical disk resource, this grants the administrator additional flexibility to use other disk storage for holding data. Use of non-physical disk data storage for the share implies that for it to be moved to the other node, equivalent storage and the same drive letter with the same information must also be available there. Further, there must be some method of data replication or mirroring for this type of storage, if the data is volatile. Some third parties may have solutions for this situation. Use of local storage in this manner is not recommended for read/write shares. For read-only information, the two data sources can remain in sync, and problems with out-of-sync data are avoided.
If you use a shared drive for data storage, make sure to establish the dependency with the share and with any other resources that depend on it. Failure to do so may cause erratic or undesired behavior of resources that depend on the disk resource. Some applications or services that rely on the disk may terminate as a result of not having the dependency.
If you use Cluster.exe to create the same resources, note that it is possible to create a network name resource without the required IP address resource. However, the network name will not go online, and will generate errors from such an attempt.
Using the Generic Application/Service Resources for Third-Party Applications
While some third-party service may require modification for use within a cluster, many services may function normally while controlled by the generic service resource type as provided with MSCS. If you have a program that runs as an application on the server's desktop that you want to be highly available, you may be able to use the generic application resource type to control this application within the cluster.
The parameters for each of these generic resource types are similar. However, when planning to have MSCS manage these resources, it is necessary to first be familiar with the software and with the resources that software requires. For example, the software might create a share of some kind for clients to access data. Most applications need access to their installation directory to access DLL or INI files, to access stored data, or, perhaps, to create temporary files. In some cases, it may be wise to install the software on a shared drive in the cluster, so that the software and necessary components may be available to either node, if the group that contains the service moves to another cluster node.
Consider a service called SomeService. Assume this is a third-party service that does something useful. The service requires that the share, SS_SHARE, must exist, and that it maps to a directory called DATA beneath the installation directory. The startup mode for the service is set for AUTOMATIC, so that the service will start automatically after the system starts. Normally, the service would be installed to C:\SomeService, and it stores dynamic configuration details in the following registry key:
HKEY_LOCAL_MACHINE \Software \SomeCompany \SomeService
If you wanted to configure MSCS to manage this service and make it available through the cluster, you would probably take the following actions:
Create a group using Cluster Administrator. You might call it SomeGroup to remain consistent with the software naming convention.
Make sure the group has a physical disk resource to store the data and the software, an IP address resource, and a network name resource. For the network name, you might use something like SomeServer, for clients to access the share that will be in the group.
Install the software on the shared drive (drive Y, for example).
Using Cluster Administrator, create a File Share resource in the group named SS_SHARE, . Make the file share resource dependent on the physical disk and network name. If either of these resources fails or goes offline, you want the share to follow the state of either dependent resource. Set the path to the Data directory on the shared drive. According to what you know about the software, this should be Y:\SomeService\Data.
Set the startup mode for the service to MANUAL. Because MSCS will be controlling the service, the service does not need to start itself before MSCS has a chance to start and bring the physical disk and other resources online.
Create a generic service resource in the group. The name for the resource should be descriptive for what it corresponds to. You might want to call it SomeService, to match the service name. Allow both cluster nodes as possible owners. Make the resource dependent on the physical disk resource and network name. Specify the service name and any necessary service parameters. Click to select the, Use network name for computer name option . This will cause the application's API call requesting the computer name to return the network name in the group. Specify to replicate the registry key by adding the following line under the Registry Replication tab: Software\SomeCompany\SomeService.
Bring all the resources in the group online and test the service.
If the service works correctly, stop the service by taking the generic service offline.
Move the group to the other node.
Install the service on the other node using the same parameters and installation directory on the shared drive.
Make sure to set the startup mode to MANUAL using the Devices utility in Control Panel.
Bring all the required resources and the generic service resource online, and test the service.
Note: If you evict a node from the cluster at any time, and have to completely reinstall a cluster node from the beginning, you will likely need to repeat steps 10 through 12 on the node if you add it back to the cluster. The procedure described here is generic in nature, and may be adaptable to various applications. If you are uncertain how to configure a service in the cluster, contact the application software vendor for more information.
Applications follow a similar procedure, except that you must substitute the generic application resource type for the generic service resource type used in the above procedure. If you have a simple application that is already installed on both systems, then you may adapt the following steps to the procedure previously described:
Create a generic application resource in a group. For this example, we will make Notepad.exe a highly available application.
For the command line, specify c:\WinNT\System32\Notepad.exe (or different directory, depending on your Windows NT installation directory). The path must be the same on both cluster nodes. Be sure to specify the working directory as needed and click to select the Allow application to interact with the desktop option, so that Notepad.exe isn't put in the background
Skip the Registry Replication tab, because Notepad.exe does not have registry keys requiring replication
Bring the resource online and notice that it appears on the desktop. Choose Move Group, and the application should appear on the other node's desktop.
Some cluster-aware applications may not require this type of setup and they may have setup wizards to create necessary cluster resources.
Deleting Resources
Some resources may be difficult to delete if any cluster nodes are offline. For example, you may be able to delete an IP address resource if only one cluster node is online. However, if you try to delete a physical disk resource while in this condition, an error message dialog box may appear similar to the following:
Physical disk resources affect the disk configuration on each node in the cluster and must be dealt with accordingly on each system at the same time. Therefore, all cluster nodes must be online to remove this type of resource from the cluster.
If you attempt to remove a resource on which other resources depend, a dialog box listing the related resources will be displayed. These resources will also be deleted, as they are linked by dependency to the individual resource chosen for removal. To avoid removal of these resources, first change or remove the configured dependencies.
Moving Resources from One Group to Another
To move resources from one group to another, both groups must be owned by the same cluster node. Attempts to move resources between groups with different owners may result in the following pop-up error message:
To move resources between groups, the groups must have the same owner. This situation may be easily corrected by moving one of the groups so that both groups have the same owner. Equally important is the fact that resources to be moved may have dependent resources. If a dependency exists between the resource to be moved and another resource, a prompt may appear that lists related resources that need to move with the resource:
Problems moving resources between groups other than those mentioned in this section may be caused by system problems or configuration-related issues. Check event logs or cluster logfiles for more information that may relate to the resource in question.
Chkdsk and Autochk
Disks attached to the shared SCSI bus interact differently with Chkdsk and the companion system startup version of the same program, Autochk. Autochk does not perform Chkdsk operations on shared drives when the system starts up, even if the operations are needed. MSCS performs a file system integrity check for each drive, when bringing a physical disk online. MSCS automatically launches Chkdsk, as necessary.
If you need to run Chkdsk on a drive, consult the following Microsoft Knowledge Base articles:
174617 |
Chkdsk Runs while Running Microsoft Cluster Server Setup |
176970 |
Chkdsk /f Does Not Run on the Shared Cluster Disk |
174797 |
How to Run CHKDSK on a Shared Drive |
Chapter 5: Troubleshooting the shared SCSI bus
Verifying Configuration
For the shared SCSI bus to work correctly, the SCSI host adapters must be configured correctly. As with SCSI specifications, each device on the bus must have a unique ID number. For proper operation, ensure that the host adapters are each set to a unique ID. For best results, set one adapter to ID 6 and the other adapter to ID 7 to ensure that the host adapters have adequate priority on the bus. Also, make sure that both adapters have the same firmware revision level. Because the shared SCSI bus is not used for booting the operating system, disable the BIOS on the adapter, unless otherwise directed by the hardware vendor.
Make sure that you connect only physical disk or hardware RAID devices to the shared bus. Devices other than these, such as tape drives, CD-ROM drives, or removable media devices, should not be used on the shared bus. You may use them on another bus for local storage.
Cables and termination are vital parts of the SCSI bus configuration, and should not be compromised. Cables need to be of high quality and within SCSI specifications. The total cable length on the shared SCSI bus needs to be within specifications. Cables supplied with complete certified systems should be correct for use with the shared SCSI bus. Check for bent pins on SCSI cable connectors and devices, and ensure that each cable is attached firmly.
Correct termination is also important. Terminate the bus at both ends, and use active terminators. Use of SCSI Y cables may allow disconnection of one of the nodes from the shared bus without losing termination. If you have terminators attached to each end of the bus, make sure that the controllers are not trying to also terminate the bus.
Make sure that all devices connected to the bus are rated for the type of controllers used. For example, do not attach differential SCSI devices to a standard SCSI controller. Verify that the controllers can each identify every disk device attached to the bus. Make sure that the configuration of each disk device is correct. Some newer smart devices can automatically terminate the bus or negotiate for SCSI IDs. If the controllers do not support this, configure the drives manually. A mixture of smart devices with others that require manual configuration can lead to problems in some configurations. For best results, configure the devices manually.
Also, make sure that the SCSI controllers on the shared bus are configured correctly and with the same parameters (other than SCSI ID). Data transfer rate and other parameter differences between the two controllers may encourage unpredictable behavior.
Adding Devices to the Shared SCSI Bus
To add disk devices to the shared SCSI bus, you must properly shut down all equipment and both cluster nodes. This is necessary because the SCSI bus may be disconnected while adding the device or devices. Attempting to add devices while the cluster and devices are in use may induce failures or other serious problems that may not be recoverable. Add the new device or devices in the same way you add a device to a standard SCSI bus. This means you must choose a unique SCSI ID for the new device, and ensure that the device configuration is correct for the bus and termination scheme. Verify cable and termination before applying power. Turn on one cluster node, and use Disk Administrator to assign a drive letter and format each new device. Before turning on the other node, create a physical disk resource using Cluster Administrator. After you create the physical disk resource and verify that the resource will go online successfully, turn on the other cluster node and allow it to join the cluster. Allowing both nodes to be online without first creating a disk resource for the new device can lead to file system corruption, as both nodes may have different interpretations of disk structure.
Verifying Cables and Termination
A good procedure for verification of cable and termination integrity is to first use the SCSI host adapter utilities to determine whether the adapter can identify all disk devices on the bus. Perform this check with only one node turned on. Then, turn off the computer and perform the same check on the other system. If this initial check succeeds, the next step is to check drive identification from the operating system level, with only one of the nodes turned on. If MSCS is already installed, then the cluster service will need to be started for shared drives to go online. Check to make sure the shared drives go online. If the device fails, there may be a problem with the device, or perhaps a cable or termination problem.
Chapter 6: Client Connectivity Problems
Clients Have Intermittent Connectivity Based on Group Ownership
If clients successfully connect to clustered resources only when a specific node is the owner, a few possible problems could lead to this condition. Check the system event log on each server for possible errors. Check to make sure that the group has at least one IP address resource and one network name resource, and that clients use one of these to access the resource or resources within the group. If clients connect with any other network name or IP address, they may not be accessing the correct server in the event that ownership of the resources changes. As a result of improper addressing, access to these resources may appear limited to a particular node.
If you are able to confirm that clients use proper addressing for the resource or resources, check the IP address and network name resources to see that they are online. Check network connectivity with the server that owns the resources. For example, try some of the following techniques:
From the server:
PING server's primary adapter IP address (on client network) PING other server's primary adapter IP address (on client network) PING IP address of the group PING Network Name of the group PING Router/Gateway between client and server (if any) PING Client IP address
If the above tests work correctly up to the router/gateway check, the problem may be elsewhere on the network because you have connectivity with the other server and local addresses. If tests complete up to the client IP address test, there may be a client configuration or routing problem.
From the client:
PING Client IP address PING Router/Gateway between client and server (if any) PING server's primary adapter IP address (on client network) PING other server's primary adapter IP address (on client network) PING IP address of the group PING Network Name of the group
If the tests from the server all pass, but you experience failures performing tests from the client, there may be client configuration problems. If all tests complete except the test using the network name of the group, there may be a name resolution problem. This may be related to client configuration, or it may be a problem with the client's designated WINS server. These problems may require network administrator intervention.
Clients Do Not Have Any Connectivity with the Cluster
If clients lose connectivity with both cluster nodes, check to make sure that the Cluster Service is running on each node. Check the system event log for possible errors. Check network connectivity between cluster nodes, and with other network devices, by using the procedure in the previous section. If the Cluster Service is running, and there are no apparent connectivity problems between the two servers, there is likely a network or client configuration problem that does not directly involve the cluster. Check to make sure the client uses the TCP/IP protocol, and has a valid IP address on the network. Also, make sure that the client is using the correct network name or IP address to access the cluster.
Clients Have Problems Accessing Data Through a File Share
If clients experience problems accessing cluster file shares, first check the resource and make sure it is online, and that any dependent resources (disks, network names, and so on) are online. Check the system event log for possible errors. Next, check network connectivity between the client and the server that owns the resource. If the data for the share is on a shared drive (using a physical disk resource), make sure that the file share resource has a dependency declared for the physical disk resource. You can reset the file share by toggling the file share resource offline and back online again. Cluster file shares behave essentially the same as standard file shares. So, make sure that clients have appropriate access at both the file system level and the share level. Also, make sure that the server has the proper number of client access licenses loaded for the clients connecting, in the event that the client cannot connect because of insufficient available connections.
Clients Cannot Access Cluster Resources Immediately After IP Address Change
If you create a new IP address resource or change the IP address of an existing resource, it is possible that clients may experience some delay if you use WINS for name resolution on the network. This problem may occur because of delays in replication between WINS servers on the network. Such delays cannot be controlled by MSCS, and must be allowed sufficient time to replicate. If you suspect there is a WINS-database problem, consult your network administrator, or contact Microsoft Product Support Services for TCP/IP support.
Clients Experience Intermittent Access
Network adapter configuration is one possible cause of intermittent access to the cluster, and of premature failover. Some autosense settings for network speed can spontaneously redetect network speed. During the detection, network traffic through the adapter may be compromised. For best results, set the network speed manually to avoid the recalibration. Also, make sure to use the correct network adapter drivers. Some adapters may require special drivers, although they may be detected as a similar device.
Chapter 7: Maintenance
Most maintenance operations within a cluster may be performed with one or more nodes online, and usually without taking the entire cluster offline. This ability allows higher availability of cluster resources.
Installing Service Packs
Microsoft Windows NT service packs may normally be installed on one node at a time and tested before you move resources to the node. With this advantage of having a cluster, if something goes wrong during the update to one node, the other node is still untouched and continuing to make resources available. As there may be exceptions to the installation of a service pack and whether or not it can be applied to a single node at a time, consult the release notes for the service pack for special instructions when installing on a cluster.
Service Packs and Interoperability Issues
To avoid potential issues or compatibility problems with other applications, check the Microsoft Knowledge Base for articles that may apply. For example, the following articles discuss installation steps or interoperability issues with Windows NT Option Pack, Microsoft SQL Server, and Windows NT Service Pack 4:
218922 |
Installing NTOP on Cluster Server with SP4 |
223258 |
How to install NTOP on MSCS 1.0 with SQL |
223259 |
How to install FTP from NTOP on Microsoft Cluster Server 1.0 |
191138 |
How to install Windows NT Option Pack on Cluster Server |
Replacing Adapters
Adapter replacement may usually be performed after moving resources and groups to the other node. If replacing a network adapter, ensure the new adapter configuration for TCP/IP exactly matches that of the old adapter. If replacing a SCSI adapter and using Y cables with external termination, it may be possible to disconnect the SCSI adapter without affecting the remaining cluster node. Check with your hardware vendor for proper replacement techniques if you want to attempt replacement without shutting down the entire cluster. This may be possible in some configurations.
Shared Disk Subsystem Replacement
With most clusters, shared disk subsystem replacement may result in the need to shut down the cluster. Check with your manufacturer and with Microsoft Product Support Services for proper procedures. Some replacements may not require much intervention, while others may require adjustments to configuration. Further information on this topic is available in the Microsoft Cluster Server Administrator's Guide and in the Microsoft Knowledge Base.
Emergency Repair Disk
The emergency repair disk (updated with Rdisk.exe) contains vital information about a particular system that you can use to help recover a system that will not start, allowing you to restore a backup, if necessary. It is recommended that the disk be updated when the system configuration experiences changes. It is important to note that the cluster configuration is not stored on the emergency repair disk. The service and driver information for the Cluster Service is stored in the system registry. However, cluster resource and group configuration is stored in a separate registry hive and may be restored from a recent system backup. NTBACKUP will backup this hive when backing up registry files (if selected). Other backup software may or may not include the cluster hive. The file associated with the cluster hive is CLUSDB and is stored with the other cluster files (usually in c:\winnt\cluster). Be sure to check system backups to ensure this hive is included.
System Backups and Recovery
The configuration for cluster resources and groups is stored in the cluster registry hive. This registry hive may be backed up and restored with NTBackup. Some third-party backup software may not include this registry hive when backing up system registry files. It is important, if you rely on a third-party backup solution, that you verify your ability to back up and restore this hive. The registry file for the cluster hive may be found in the directory where the cluster software was installed — not on the quorum disk.
As most backup software (at the time of this writing) is not cluster-aware, it may be important to establish a network path to shared data for use in system backups. For example, if you use a local path to the data (example: G:\), and if the node loses ownership of the drive, the backup operation may fail because it cannot reach the data using the local device path. However, if you create a cluster-available share to the disk structure, and map a drive letter to it, the connection may be re-established if ownership of the actual disk changes. Although the ultimate solution would be a fully cluster-aware backup utility, this technique may be a better alternative until such a utility is available.
What not to do on a cluster server
Below is a list of things not to do with a cluster. While there may be more items that may cause problems, these items are definate words of warning. Article numbers for related Microsoft Knowledge Base articles are noted where applicable.
Do not create software fault tolerant sets with shared disks as members. (171052)
Do not add resources to the cluster group. (168948)
Do not install MSCS when both nodes have been online and connected to the shared storage at the same time without MSCS installed and running on at least one node first.
Do not change computer names of either node.
Do not use WINS static entries for cluster nodes or cluster addresses. (217199)
Do not configure WINS or default gateway addresses for the private interconnect. (193890)
Do not attempt to configure cluster resources to use unsupported network protocols or related network services (IPX, Netbeui, DLC, Appletalk, Services for Macintosh, etc.) Microsoft Cluster Server works only with the TCP/IP protocol.
Do not delete the HKEY_LOCAL_MACHINE \System \Disk registry key while the cluster is running, or if you are using local software fault tolerance.
Appendix A: MSCS Event messages
Event ID 1000
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server suffered an unexpected fatal error at line ### of source module %path%. The error code was 1006. |
Problem: |
Messages similar to this may occur in the event of a fatal error that may cause the Cluster Service to terminate on the node that experienced the error. |
Solution: |
Check the system event log and the cluster diagnostic logfile for additional information. It is possible that the cluster service may restart itself after the error. This event message may indicate serious problems that may be related to hardware or other causes. |
Event ID 1002
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server handled an unexpected error at line 528 of source module G:\Nt\Private\Cluster\Resmon\Rmapi.c. The error code was 5007. |
Problem: |
Messages similar to this may occur after installation of Microsoft Cluster Server. If the cluster service starts and successfully forms or joins the cluster, they may be ignored. Otherwise, these errors may indicate a corrupt quorum logfile or other problem. |
Solution: |
Ignore the error if the cluster appears to be working properly. Otherwise, you may want to try creating a new quorum logfile using the -noquorumlogging or -fixquorum parameters as documented in the Microsoft Cluster Server Administrator's Guide. |
Event ID 1006
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server was halted because of a cluster membership or communications error. The error code was 4. |
Problem: |
An error may have occurred between communicating cluster nodes that affected cluster membership. This error may occur if nodes lose the ability to communicate with each other. |
Solution: |
Check network adapters and connections between nodes. Check the system event log for errors. There may be a network problem preventing reliable communication between cluster nodes. |
Event ID 1007
Source: |
ClusSvc |
---|---|
Description: |
A new node, "ComputerName", has been added to the cluster. |
Information: |
The Microsoft Cluster Server Setup program ran on an adjacent computer. The setup process completed, and the node was admitted for cluster membership. No action required. |
Event ID 1009
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server could not join an existing cluster and could not form a new cluster. Microsoft Cluster Server has terminated. |
Problem: |
The cluster service started and attempted to join a cluster. The node may not be a member of an existing cluster because of eviction by an administrator. After a cluster node has been evicted from the cluster, the cluster software must be removed and reinstalled if you want it to rejoin the cluster. And, because a cluster already exists with the same cluster name, the node could not form a new cluster with the same name. |
Solution: |
Remove MSCS from the affected node, and reinstall MSCS on that system if desired. |
Event ID 1010
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server is shutting down because the current node is not a member of any cluster. Microsoft Cluster Server must be reinstalled to make this node a member of a cluster. |
Problem: |
The cluster service attempted to run but found that it is not a member of an existing cluster. This may be due to eviction by an administrator or incomplete attempt to join a cluster. This error indicates a need to remove and reinstall the cluster software. |
Solution: |
Remove MSCS from the affected node, and reinstall MSCS on that server if desired. |
Event ID 1011
Source: |
ClusSvc |
---|---|
Description: |
Cluster Node "ComputerName" has been evicted from the cluster. |
Information: |
A cluster administrator evicted the specified node from the cluster. |
Event ID 1012
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server did not start because the current version of Windows NT is not correct. Microsoft Cluster Server runs only on Windows NT Server, Enterprise Edition. |
Information: |
The cluster node must be running the Enterprise Edition version of Windows NT Server, and must have Service Pack 3 or later installed. This error may occur if you force an upgrade using the installation disks, which effectively removes any service packs installed. |
Event ID 1015
Source: |
ClusSvc |
---|---|
Description: |
No checkpoint record was found in the logfile W:\Mscs\Quolog.log; the checkpoint file is invalid or was deleted. |
Problem: |
The Cluster Service experienced difficulty reading data from the quorum logfile. The logfile could be corrupted. |
Solution: |
If the Cluster Service fails to start because of this problem, try manually starting the cluster service with the -noquorumlogging parameter. If you need to adjust the quorum disk designation, use the -fixquorum startup parameter when starting the cluster service. Both of these parameters are covered in the MSCS Administrator's Guide. |
Event ID 1016
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server failed to obtain a checkpoint from the cluster database for log file W:\Mscs\Quolog.log. |
Problem: |
The cluster service experienced difficulty establishing a checkpoint for the quorum logfile. The logfile could be corrupt, or there may be a disk problem. |
Solution: |
You may need to use procedures to recover from a corrupt quorum logfile. You may also need to run chkdsk on the volume to ensure against file system corruption. |
Event ID 1019
Source: |
ClusSvc |
---|---|
Description: |
The log file D:\MSCS\Quolog.log was found to be corrupt. An attempt will be made to reset it, or you should use the Cluster Administrator utility to adjust the maximum size. |
Problem: |
The quorum logfile for the cluster was found to be corrupt. The system will attempt to resolve the problem. |
Solution: |
The system will attempt to resolve this problem. This error may also be an indication that the cluster property for maximum size should be increased through the Quorum tab. You can manually resolve this problem by using the -noquorumlogging parameter. |
Event ID 1021
Source: |
ClusSvc |
---|---|
Description: |
There is insufficient disk space remaining on the quorum device. Please free up some space on the quorum device. If there is no space on the disk for the quorum log files then changes to the cluster registry will be prevented. |
Problem: |
Available disk space is low on the quorum disk and must be resolved. |
Solution: |
Remove data or unnecessary files from the quorum disk so that sufficient free space exists for the cluster to operate. If necessary, designate another disk with adequate free space as the quorum device. |
Event ID 1022
Source: |
ClusSvc |
---|---|
Description: |
There is insufficient space left on the quorum device. The Microsoft Cluster Server cannot start. |
Problem: |
Available disk space is low on the quorum disk and is preventing the startup of the cluster service. |
Solution: |
Remove data or unnecessary files from the quorum disk so that sufficient free space exists for the cluster to operate. If necessary, use the -fixquorum startup option to start one node. Bring the quorum resource online and adjust free space or designate another disk with adequate free space as the quorum device. |
Event ID 1023
Source: |
ClusSvc |
---|---|
Description: |
The quorum resource was not found. The Microsoft Cluster Server has terminated. |
Problem: |
The device designated as the quorum resource could not be found. This could be due to the device having failed at the hardware level, or that the disk resource corresponding to the quorum drive letter does not match or no longer exists. |
Solution: |
Use the -fixquorum startup option for the cluster service. Investigate and resolve the problem with the quorum disk. If necessary, designate another disk as the quorum device and restart the cluster service before starting other nodes. |
Event ID 1024
Source: |
ClusSvc |
---|---|
Description: |
The registry checkpoint for cluster resource "resourcename" could not be restored to registry key registrykeyname. The resource may not function correctly. Make sure that no other processes have open handles to registry keys in this registry subkey. |
Problem: |
The registry key checkpoint imposed by the cluster service failed because an application or process has an open handle to the registry key or subkey. |
Solution: |
Close any applications that may have an open handle to the registry key so that it may be replicated as configured with the resource properties. If necessary, contact the application vendor about this problem. |
Event ID 1034
Source: |
ClusSvc |
---|---|
Description: |
The disk associated with cluster disk resource resource name could not be found. The expected signature of the disk was signature. If the disk was removed from the cluster, the resource should be deleted. If the disk was replaced, the resource must be deleted and created again to bring the disk online. If the disk has not been removed or replaced, it may be inaccessible at this time because it is reserved by another cluster node. |
Problem: |
The cluster service attempted to mount a physical disk resource in the cluster. The cluster disk driver could not locate a disk with this signature. The disk may be offline or may have failed. This error may also occur if the drive has been replaced or reformatted. This error may also occur if another system continues to hold a reservation for the disk. |
Solution: |
Determine why the disk is offline or non-operational. Check cables, termination, and power for the device. If the drive has failed, replace the drive and restore the resource to the same group as the old drive. Remove the old resource. Restore data from a backup and adjust resource dependencies within the group to point to the new disk resource. |
Event ID 1035
Source: |
ClusSvc |
---|---|
Description: |
Cluster disk resource %1 could not be mounted. |
Problem: |
The cluster service attempted to mount a disk resource in the cluster and could not complete the operation. This could be due to a file system problem, hardware issue, or drive letter conflict. |
Solution: |
Check for drive letter conflicts, evidence of file system issues in the system event log, and for hardware problems. |
Event ID 1036
Source: |
ClusSvc |
---|---|
Description: |
Cluster disk resource "resourcename" did not respond to a SCSI inquiry command. |
Problem: |
The disk did not respond to the issued SCSI command. This usually indicates a hardware problem. |
Solution: |
Check SCSI bus configuration. Check the configuration of SCSI adapters and devices. This may indicate a misconfigured or a failing device. |
Event ID 1037
Source: |
ClusSvc |
---|---|
Description: |
Cluster disk resource %1 has failed a filesystem check. Please check your disk configuration. |
Problem: |
The cluster service attempted to mount a disk resource in the cluster. A filesystem check was necessary and failed during the process. |
Solution: |
Check cables, termination, and device configuration. If the drive has failed, replace the drive and restore data. This may also indicate a need to reformat the partition and restore data from a current backup. |
Event ID 1038
Source: |
ClusSvc |
---|---|
Description: |
Reservation of cluster disk "Disk W:" has been lost. Please check your system and disk configuration. |
Problem: |
The cluster service had exclusive use of the disk, and lost the reservation of the device on the shared SCSI bus. |
Solution: |
The disk may have gone offline or failed. Another node may have taken control of the disk or a SCSI bus reset command was issued on the bus that caused a loss of reservation. |
Event ID 1040
Source: |
ClusSvc |
---|---|
Description: |
Cluster generic service "ServiceName" could not be found. |
Problem: |
The cluster service attempted to bring the specified generic service resource online. The service could not be located and could not be managed by the Cluster Service. |
Solution: |
Remove the generic service resource if this service is no longer installed. The parameters for the resource may be invalid. Check the generic service resource properties and confirm correct configuration. |
Event ID 1041
Source: |
ClusSvc |
---|---|
Description: |
Cluster generic service "ServiceName" could not be started. |
Problem: |
The cluster service attempted to bring the specified generic service resource online. The service could not be started at the operating system level. |
Solution: |
Remove the generic service resource if this service is no longer installed. The parameters for the resource may be invalid. Check the generic service resource properties and confirm correct configuration. Check to make sure the service account has not expired, that it has the correct password, and has necessary rights for the service to start. Check the system event log for any related errors. |
Event ID 1042
Source: |
ClusSvc |
---|---|
Description: |
Cluster generic service "resourcename" failed. |
Problem: |
The service associated with the mentioned generic service resource failed. |
Solution: |
Check the generic service properties and service configuration for errors. Check system and application event logs for errors. |
Event ID 1043
Source: |
ClusSvc |
---|---|
Description: |
The NetBIOS interface for "IP Address" resource has failed. |
Problem: |
The network adapter for the specified IP address resource has experienced a failure. As a result, the IP address is either offline, or the group has moved to a surviving node in the cluster. |
Solution: |
Check the network adapter and network connection for problems. Resolve the network-related problem. |
Event ID 1044
Source: |
ClusSvc |
---|---|
Description: |
Cluster IP Address resource %1 could not create the required NetBios interface. |
Problem: |
The cluster service attempted to initialize an IP Address resource and could not establish a context with NetBios. |
Solution: |
This could be a network adapter or network adapter driver related issue. Make sure the adapter is using a current driver and the correct driver for the adapter. If this is an embedded adapter, check with the OEM to determine if a specific OEM version of the driver is a requirement. If you already have many IP Address resources defined, make sure you have not reached the NetBios limit of 64 addresses. If you have IP Address resources defined that do not have a need for NetBios affiliation, use the IP Address private property to disable NetBios for the address. This option is available in SP4 and helps to conserve NetBios address slots. |
Event ID 1045
Source: |
ClusSvc |
---|---|
Description: |
Cluster IP address "IP address" could not create the required TCP/IP Interface.. |
Problem: |
The cluster service tried to bring an IP address online. The resource properties may specify an invalid network or malfunctioning adapter. This error may occur if you replace a network adapter with a different model and continue to use the old or inappropriate driver. As a result, the IP address resource cannot be bound to the specified network. |
Solution: |
Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource. |
Event ID 1046
Source: |
ClusSvc |
---|---|
Description: |
Cluster IP Address resource %1 cannot be brought online because the subnet mask parameter is invalid. Please check your network configuration. |
Problem: |
The cluster service tried to bring an IP address resource online but could not do so. The subnet mask for the resource is either blank or otherwise invalid. |
Solution: |
Correct the subnet mask for the resource. |
Event ID 1047
Source: |
ClusSvc |
---|---|
Description: |
Cluster IP Address resource %1 cannot be brought online because the IP address parameter is invalid. Please check your network configuration. |
Problem: |
The cluster service tried to bring an IP address resource online but could not do so. The IP address property contains an invalid value. This may be caused by incorrectly creating the resource through an API or the command line interface. |
Solution: |
Correct the IP address properties for the resource. |
Event ID 1048
Source: |
ClusSvc |
---|---|
Description: |
Cluster IP address, "IP address," cannot be brought online because the specified adapter name is invalid. |
Problem: |
The cluster service tried to bring an IP address online. The resource properties may specify an invalid network or a malfunctioning adapter. This error may occur if you replace a network adapter with a different model. As a result, the IP address resource cannot be bound to the specified network. |
Solution: |
Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource. |
Event ID 1049
Source: |
ClusSvc |
---|---|
Description: |
Cluster IP address "IP address" cannot be brought online because the address IP address is already present on the network. Please check your network configuration. |
Problem: |
The cluster service tried to bring an IP address online. The address is already in use on the network and cannot be registered. Therefore, the resource cannot be brought online. |
Solution: |
Resolve the IP address conflict, or choose another address for the resource. |
Event ID 1050
Source: |
ClusSvc |
---|---|
Description: |
Cluster Network Name resource %1 cannot be brought online because the name %2 is already present on the network. Please check your network configuration. |
Problem: |
The cluster service tried to bring a Network Name resource online. The name is already in use on the network and cannot be registered. Therefore, the resource cannot be brought online. |
Solution: |
Resolve the conflict, or choose another network name. |
Event ID 1051
Source: |
ClusSvc |
---|---|
Description: |
Cluster Network Name resource "resourcename" cannot be brought online because it does not depend on an IP address resource. Please add an IP address dependency. |
Problem: |
The cluster service attempted to bring the network name resource online, and found that a required dependency was missing. |
Solution: |
Microsoft Cluster Server requires an IP address dependency for network name resource types. Cluster Administrator presents a pop-up message if you attempt to remove this dependency without specifying another like dependency. To resolve this error, replace the IP address dependency for this resource. Because it is difficult to remove this dependency, Event 1051 may be an indication of problems within the cluster registry. Check other resources for possible dependency problems. |
Event ID 1052
Source: |
ClusSvc |
---|---|
Description: |
Cluster Network Name resource "resourcename" cannot be brought online because the name could not be added to the system. |
Problem: |
The cluster service attempted to bring the network name resource online but the attempt failed. |
Solution: |
Check the system event log for errors. Check network adapter configuration and operation. Check TCP/IP configuration and name resolution methods. Check WINS servers for possible database problems or invalid static mappings. |
Event ID 1053
Source: |
ClusSvc |
---|---|
Description: |
Cluster File Share "resourcename" cannot be brought online because the share could not be created. |
Problem: |
The cluster service attempted to bring the share online but the attempt to create the share failed. |
Solution: |
Make sure the Server service is started and functioning properly. Check the path for the share. Check ownership and permissions on the directory. Check the system event log for details. Also, if diagnostic logging is enabled, check the log for an entry related to this failure. Use the net helpmsgerrornumber command with the error code found in the log entry. |
Event ID 1054
Source: |
ClusSvc |
---|---|
Description: |
Cluster File Share %1 could not be found. |
Problem: |
The share corresponding to the named File Share resource was deleted using a mechanism other than Cluster Administrator. This may occur if you select the share with Explorer and choose 'Not Shared'. |
Solution: |
Delete shares or take them offline via Cluster Administrator or the command line program CLUSTER.EXE. |
Event ID 1055
Source: |
ClusSvc |
---|---|
Description: |
Cluster File Share "sharename" has failed a status check. |
Problem: |
The cluster service (through resource monitors) periodically monitors the status of cluster resources. In this case, a file share failed a status check. This could mean that someone attempted to delete the share through Windows NT Explorer or Server Manager, instead of through Cluster Administrator. This event could also indicate a problem with the Server service, or access to the shared directory. |
Solution: |
Check the system event log for errors. Check the cluster diagnostic log (if it is enabled) for status codes that may be related to this event. Check the resource properties for proper configuration. Also, make sure the file share has proper dependencies defined for related resources. |
Event ID 1056
Source: |
ClusSvc |
---|---|
Description: |
The cluster database on the local node is in an invalid state. Please start another node before starting this node. |
Problem: |
The cluster database on the local node may be in a default state from the installation process and the node has not properly joined with an existing node. |
Solution: |
Make sure another node of the same cluster is online first before starting this node. Upon joining with another cluster node, the node will receive an updated copy of the official cluster database and should alleviate this error. |
Event ID 1057
Source: |
ClusSvc |
---|---|
Description: |
The cluster service CLUSDB could not be opened. |
Problem: |
The Cluster Service tried to open the CLUSDB registry hive and could not do so. As a result, the cluster service cannot be brought online. |
Solution: |
Check the cluster installation directory for the existence of a file called CLUSDB. Make sure the registry file is not held open by any applications, and that permissions on the file allow the cluster service access to this file and directory. |
Event ID 1058
Source: |
ClusSvc |
---|---|
Description: |
The Cluster Resource Monitor could not load the DLL %1 for resource type %2. |
Problem: |
The Cluster Service tried to bring a resource online that requires a specific resource DLL for the resource type. The DLL is either missing, corrupt, or an incompatible version. As a result, the resource cannot be brought online. |
Solution: |
Check the cluster installation directory for the existence of the named resource DLL. Make sure the DLL exists in the proper directory on both nodes. |
Event ID 1059
Source: |
ClusSvc |
---|---|
Description: |
The Cluster Resource DLL %1 for resource type %2 failed to initialize. |
Problem: |
The Cluster Service tried to load the named resource DLL and it failed to initialize. The DLL could be corrupt, or an incompatible version. As a result, the resource cannot be brought online. |
Solution: |
Check the cluster installation directory for the existence of the named resource DLL. Make sure the DLL exists in the proper directory on both nodes and is of proper version. If the DLL is clusres.dll, this is the default resource DLL that comes with MSCS. Check to make sure the version/date stamp is equivalent to or with a later date than the version contained in the service pack in use. |
Event ID 1061
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server successfully formed a cluster on this node. |
Information: |
This informational message indicates that an existing cluster of the same name was not detected on the network, and that this node elected to form the cluster and own access to the quorum disk. |
Event ID 1062
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server successfully joined the cluster. |
Information: |
When the Cluster Service started, it detected an existing cluster on the network and was able to successfully join the cluster. No action needed. |
Event ID 1063
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server was successfully stopped. |
Information: |
The Cluster Service was stopped manually by the administrator. |
Event ID 1064
Source: |
ClusSvc |
---|---|
Description: |
The quorum resource was changed. The old quorum resource could not be marked as obsolete. If there is a partition in time, you may lose changes to your database, because the node that is down will not be able to get to the new quorum resource. |
Problem: |
The administrator changed the quorum disk designation without all cluster nodes present. |
Solution: |
When other cluster nodes attempt to join the existing cluster, they may not be able to connect to the quorum disk, and may not participate in the cluster, because their configuration indicates a different quorum device. For any nodes that meet this criterion, you may need to use the -fixquorum option to start the Cluster Service on these nodes and make configuration changes. |
Event ID 1065
Source: |
ClusSvc |
---|---|
Description: |
Cluster resource %1 failed to come online. |
Problem: |
The cluster service attempted to bring the resource online, but the resource could not reach an online status. The resource may have exhausted the timeout period allotted for the resource to reach an online state. |
Solution: |
Check any parameters related to the resource and check the event log for details. |
Event ID 1066
Source: |
ClusSvc |
---|---|
Description: |
Cluster disk resource resourcename is corrupted. Running Chkdsk /F to repair problems. |
Problem: |
The Cluster Service detected corruption on the indicated disk resource and started Chkdsk /f on the volume to repair the structure. The Cluster Service will automatically perform this operation, but only for cluster-defined disk resources (not local disks). |
Solution: |
Scan the event log for additional errors. The disk corruption could be indicative of other problems. Check related hardware and devices on the shared bus and ensure proper cables and termination. This error may be a symptom of failing hardware or a deteriorating drive. |
Event ID 1067
Source: |
ClusSvc |
---|---|
Description: |
Cluster disk resource %1 has corrupt files. Running Chkdsk /F to repair problems. |
Problem: |
The Cluster Service detected corruption on the indicated disk resource and started Chkdsk /f on the volume to repair the structure. The Cluster Service will automatically perform this operation, but only for cluster-defined disk resources (not local disks). |
Solution: |
Scan the event log for additional errors. The disk corruption could be indicative of other problems. Check related hardware and devices on the shared bus and ensure proper cables and termination. This error may be a symptom of failing hardware or a deteriorating drive. |
Event ID 1068
Source: |
ClusSvc |
---|---|
Description: |
The cluster file share resource resourcename failed to start. Error 5. |
Problem: |
The file share cannot be brought online. The problem may be caused by permissions to the directory or disk in which the directory resides. This may also be related to permission problems within the domain. |
Solution: |
Check to make sure that the Cluster Service account has rights to the directory to be shared. Make sure a domain controller is accessible on the network. Make sure dependencies for the share and for other resource in the group are set correctly. Error 5 translates to "Access Denied." |
Event ID 1069
Source: |
ClusSvc |
---|---|
Description: |
Cluster resource "Disk G:" failed. |
Problem: |
The named resource failed and the cluster service logged the event. In this example, a disk resource failed. |
Solution: |
For disk resources, check the device for proper operation. Check cables, termination, and logfiles on both cluster nodes. For other resources, check resource properties for proper configuration, and check to make sure dependencies are configured correctly. Check the diagnostic log (if it is enabled) for status codes corresponding to the failure. |
Event ID 1070
Source: |
ClusSvc |
---|---|
Description: |
Cluster node attempted to join the cluster but failed with error 5052. |
Problem: |
The cluster node attempted to join an existing cluster but was unable to complete the process. This problem may occur if the node was previously evicted from the cluster. |
Solution: |
If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server. |
Event ID 1071
Source: |
ClusSvc |
---|---|
Description: |
Cluster node 2 attempted to join but was refused. Error 5052. |
Problem: |
Another node attempted to join the cluster and this node refused the request. |
Solution: |
If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server. Look in Cluster Administrator to see if the other node is listed as a possible cluster member. |
Event ID 1073
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server was halted to prevent an inconsistency within the cluster. The error code was 5028. |
Problem: |
The cluster service on the affected node was halted because of some kind of inconsistency between cluster nodes. |
Solution: |
Check connectivity between systems. This error may be an indication of configuration or hardware problems. |
Event ID 1077
Source: |
ClusSvc |
---|---|
Description: |
The TCP/IP interface for cluster IP address resourcename has failed. |
Problem: |
The IP address resource depends on the proper operation of a specific network interface as configured in the resource properties. The network interface failed. |
Solution: |
Check the system event log for errors. Check the network adapter for proper operation and replace the adapter if necessary. Check to make sure the proper adapter driver is loaded for the device and check for newer versions of the driver. |
Event ID 1080
Source: |
ClusSvc |
---|---|
Description: |
The Microsoft Cluster Server could not write file W:\MSCS\Chk7f5.tmp. The disk may be low on disk space, or some other serious condition exists. |
Problem: |
The cluster service attempted to create a temporary file in the MSCS directory on the quorum disk. Lack of disk space or other factors prevented successful completion of the operation. |
Solution: |
Check the quorum drive for available disk space. The file system may be corrupted or the device may be failing. Check file system permissions to ensure that the cluster service account has full access to the drive and directory. |
Event ID 1093
Source: |
ClusSvc |
---|---|
Description: |
Node %1 is not a member of cluster %2. If the name of the node has changed, Microsoft Cluster Server must be reinstalled. |
Problem: |
The cluster service attempted to start but found that it was not a valid member of the cluster. |
Solution: |
Microsoft Cluster Server may need to be reinstalled on this node. If this is the result of a server name change, be sure to evict the node from the cluster (from an operational node) prior to reinstallation. |
Event ID 1096
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server cannot use network adapter %1 because it does not have a valid IP address assigned to it. |
Problem: |
The network configuration for the adapter has changed and the cluster service cannot make use of the adapter for the network that was assigned to it. |
Solution: |
Check the network configuration. If a DHCP address was used for the primary address of the adapter, the address may have been lost. For best results, use a static address. |
Event ID 1097
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server did not find any network adapters with valid IP addresses installed in the system. The node will not be able to join a cluster. |
Problem: |
The network configuration for the system needs to be corrected to match the same connected networks as the other node of the cluster. |
Solution: |
Check the network configuration and make sure it agrees with the working node of the cluster. Make sure the same networks are accessible from all systems in the cluster. |
Event ID 1098
Source: |
ClusSvc |
---|---|
Description: |
The node is no longer attached to cluster network network_id by adapter adapter. Microsoft Cluster Server will delete network interface interface from the cluster configuration. |
Information: |
The Cluster Service observed a change in network configuration that might be induced by a change of adapter type or by removal of a network. The network will be removed from the list of available networks. |
Event ID 1100
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server discovered that the node is now attached to cluster network network_id by adapter adapter. A new cluster network interface will be added to the cluster configuration. |
Information: |
The Cluster Service noticed a new network accessible by the cluster nodes, and has added the new network to the list of accessible networks. |
Event ID 1102
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server discovered that the node is attached to a new network by adapter adapter. A new network and network interface will be added to the cluster configuration. |
Information: |
The cluster service noticed the addition of a new network. The network will be added to list of available networks. |
Event ID 1104
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server failed to update the configuration for one of the nodes Network interfaces. The error code was errorcode. |
Problem: |
The cluster service attempted to update a cluster node and could not perform the operation. |
Solution: |
Use the net helpmsg errorcode command to find an explanation of the underlying error. For example, error 1393 indicates that a corrupted disk caused the operation to fail. |
Event ID 1105
Source: |
ClusSvc |
---|---|
Description: |
Microsoft Cluster Server failed to initialize the RPC services. The error code was %1. |
Problem: |
The cluster service attempted to utilize required RPC services and could not successfully perform the operation. |
Solution: |
Use the net helpmsg errorcode command to find an explanation of the underlying error. Check the system event log for other RPC related errors or performance problems. |
Event ID 1107
Source: |
ClusSvc |
---|---|
Description: |
Cluster node node name failed to make a connection to the node over network network name. The error code was 1715. |
Problem: |
The cluster service attempted to connect to another cluster node over a specific network and could not establish a connection. This error is a warning message. |
Solution: |
Check to make sure that the specified network is available and functioning correctly. If the node experiences this problem, it may try other available networks to establish the desired connection. |
Event ID 1109
Source: |
ClusSvc |
---|---|
Description: |
The node was unable to secure its connection to cluster node %1. The error code was %2. Check that both nodes can communicate with their domain controllers. |
Problem: |
The cluster service attempted to connect to another cluster node and could not establish a secure connection. This could indicate domain connectivity problems. |
Solution: |
Check to make sure that the networks are available and functioning correctly. This may be a symptom of larger network problems or domain security issues. |
Event ID 1115
Source: |
ClusSvc |
---|---|
Description: |
An unrecoverable error caused the join of node nodename to the cluster to be aborted. The error code was errorcode. |
Problem: |
A node attempted to join the cluster but was unable to obtain successful membership. |
Solution: |
Use the NET HELPMSG errorcode command to obtain further description of the error that prevented the join operation. For example, error code 1393 indicates that a disk structure is corrupted and nonreadable. An error code like this could indicate a corrupted quorum disk. |
Related Event Messages
Event ID 9
Source: |
Disk |
---|---|
Description: |
The device, \Device\ScsiPort2, did not respond within the timeout period. |
Problem: |
An I/O request was sent to a SCSI device and was not serviced within acceptable time. The device timeout was logged by this event. |
Solution: |
You may have a device or controller problem. Check SCSI cables, termination, and adapter configuration. Excessive recurrence of this event message may indicate a serious problem that could indicate potential for data loss or corruption. If necessary, contact your hardware vendor for help troubleshooting this problem. |
Event ID 101
Source: |
W3SVC |
---|---|
Description: |
The server was unable to add the virtual root "/" for the directory "path" because of the following error: The system cannot find the path specified. The data is the error. |
Problem: |
The World Wide Web Publishing service could not create a virtual root for the IIS Virtual Root resource. The directory path may have been deleted. |
Solution: |
Re-create or restore the directory and contents. Check the resource properties for the IIS Virtual Root resource and ensure that the path is correct. This problem may occur if you had an IIS Virtual Root resource defined and then uninstalled Microsoft Cluster Server without first deleting the resource. In this case, you may evaluate and change virtual root properties by using the Internet Service Manager. |
Event ID 1004
Source: |
DHCP |
---|---|
Description: |
DHCP IP address lease "IP address" for the card with network address "media access control Address" has been denied. |
Problem: |
This system uses a DHCP-assigned IP address for a network adapter. The system attempted to renew the leased address and the DHCP server denied the request. The address may already be allocated to another system. The DHCP server may also have a problem. Network connectivity may be affected by this problem. |
Solution: |
Resolve the problem by correcting DHCP server problems or assigning a static IP address. For best results within a cluster, use statically assigned IP addresses. |
Event ID 1005
Source: |
DHCP |
---|---|
Description: |
DHCP failed to renew a lease for the card with network address "MAC Address." The following error occurred: The semaphore timeout period has expired. |
Problem: |
This system uses a DHCP assigned IP address for a network adapter. The system attempted to renew the leased address and was unable to renew the lease. Network operations on this system may be affected. |
Solution: |
There may be a connectivity problem preventing access to the DHCP server that leased the address, or the DHCP server may be offline. For best results within a cluster, use statically assigned IP addresses. |
Event ID 2511
Source: |
Server |
---|---|
Description: |
The server service was unable to recreate the share "Sharename" because the directory "path" no longer exists. |
Problem: |
The Server service attempted to create a share using the specified directory path. This problem may occur if you create a share (outside of Cluster Administrator) on a cluster shared device. If the device is not exclusively available to this computer, the server service cannot create the share. Also, the directory may no longer exist or there may be RPC related issues. |
Solution: |
Correct the problem by creating a shared resource through Cluster Administrator, or correct the problem with the missing directory. Check dates of RPC files in the system32 directory. Make sure they concur with those contained in the service pack in use, or any hotfixes applied. |
Event ID 4199
Source: |
TCPIP |
---|---|
Description: |
The system detected an address conflict for IP address "IP address" with the system having network hardware address "media access control address." Network operations on this system may be disrupted as a result. |
Problem: |
Another system on the network may be using one of the addresses configured on this computer. |
Solution: |
Resolve the IP address conflict. Check network adapter configuration and any IP address resources defined within the cluster. |
Event ID 5719
Source: |
Netlogon |
---|---|
Description: |
No Windows NT Domain controller is available for domain "domain." (This event is expected and can be ignored when booting with the "No Net" hardware profile.) The following error occurred: There are currently no logon servers available to service the logon request. |
Problem: |
A domain controller for the domain could not be contacted. As a result, proper authentication of accounts could not be completed. This may occur if the network is disconnected or disabled through system configuration. |
Solution: |
Resolve the connectivity problem with the domain controller and restart the system. |
Event ID 7000
Source: |
Service Control Manager |
---|---|
Description: |
The Cluster Service failed to start because of the following error: The service did not start because of a logon failure. |
Problem: |
The service control manager attempted to start a service (possibly ClusSvc). It could not authenticate the service account. This error may be seen with Event 7013. |
Solution: |
The service account could not be authenticated. This may be because of a failure contacting a domain controller, or because account credentials are invalid. Check the service account name and password and ensure that the account is available and that credentials are correct. You may also try running the cluster service from a command prompt (if currently logged on as an administrator) by changing to the %systemroot%\Cluster directory (or where you installed the software) and typing ClusSvc -debug. If the service starts and runs correctly, stop it by pressing CTRL+C and troubleshoot the service account problem. This error may also occur if network connectivity is disabled through the system configuration or hardware profile. Microsoft Cluster Server requires network connectivity. |
Event ID 7013
Source: |
Service Control Manager |
---|---|
Description: |
Logon attempt with current password failed with the following error: There are currently no logon servers available to service the logon request. |
More Info: |
The description for this error message may vary somewhat based on the actual error. For example, another error that may be listed in the event detail might be: "Logon Failure: unknown username or bad password." |
Problem: |
The service control manager attempted to start a service (possibly ClusSvc). It could not authenticate the service account with a domain controller. |
Solution: |
The service account may be in another domain, or this system is not a domain controller. It is acceptable for the node to be a non-domain controller, but the node needs access to a domain controller within the domain as well as the domain that the service account belongs to. Inability to contact the domain controller may be because of a problem with the server, network, or other factors. This problem is not related to the cluster software and must be resolved before you start the cluster software. This error may also occur if network connectivity is disabled through the system configuration or hardware profile. Microsoft Cluster Server requires network connectivity. |
Event ID 7023
Source: |
Service Control Manager |
---|---|
Description: |
The Cluster Server service terminated with the following error: The quorum log could not be created or mounted successfully. |
Problem: |
The Cluster Service attempted to start but could not gain access to the quorum log on the quorum disk. This may be because of problems gaining access to the disk or problems joining a cluster that has already formed. |
Solution: |
Check the disk and quorum log for problems. If necessary, check the cluster logfile for more information. There may be other events in the system event log that may give more information. |
Appendix B: Using AND Reading THE Cluster Logfile
CLUSTERLOG Environment Variable
If you set the CLUSTERLOG environment variable, the cluster will create a logfile that contains diagnostic information using the path specified. Important events during the operation of the Cluster Service will be logged in this file. Because so many different events occur, the logfile may be somewhat cryptic or hard to read. This document gives some hints about how to read the logfile and information about what items to look for.
Note: Each time you attempt to start the Cluster Service, the log will be cleared and a new logfile started. Each component of MSCS that places an entry in the logfile will indicate itself by abbreviation in square brackets. For example, the Node Manager component would be abbreviated [NM]. Logfile entries will vary from one cluster to another. As a result, other logfiles may vary from excerpts referenced in this document.
Note: Log entry lines in the following sections have been wrapped for space constraints in this document. The lines do not normally wrap.
Operating System Version Number and Service Pack Level
Near the beginning of the logfile, notice the build number of MSCS, followed by the operating system version number and service pack level. If you call for support, engineers may ask for this information:
082::14-21:29:26.625 Cluster Service started - Cluster Version 1.224. 082::14-21:29:26.625 OS Version 4.0.1381 - Service Pack 3.
Cluster Service Startup
Following the version information, some initialization steps occur. Those steps are followed by an attempt to join the cluster, if one node already exists in a running state. If the Cluster Service could not detect any other cluster members, it will attempt to form the cluster. Consider the following log entries:
0b5::12-20:15:23.531 We're initing Ep... 0b5::12-20:15:23.531 [DM]: Initialization 0b5::12-20:15:23.531 [DM] DmpRestartFlusher: Entry 0b5::12-20:15:23.531 [DM] DmpStartFlusher: Entry 0b5::12-20:15:23.531 [DM] DmpStartFlusher: thread created 0b5::12-20:15:23.531 [NMINIT] Initializing the Node Manager... 0b5::12-20:15:23.546 [NMINIT] Local node name = NODEA. 0b5::12-20:15:23.546 [NMINIT] Local node ID = 1. 0b5::12-20:15:23.546 [NM] Creating object for node 1 (NODEA) 0b5::12-20:15:23.546 [NM] node 1 state 1 0b5::12-20:15:23.546 [NM] Initializing networks. 0b5::12-20:15:23.546 [NM] Initializing network interface facilities. 0b5::12-20:15:23.546 [NMINIT] Initialization complete. 0b5::12-20:15:23.546 [FM] Starting worker thread... 0b5::12-20:15:23.546 [API] Initializing 0a9::12-20:15:23.546 [FM] Worker thread running 0b5::12-20:15:23.546 [lm] :LmInitialize Entry. 0b5::12-20:15:23.546 [lm] :TimerActInitialize Entry. 0b5::12-20:15:23.546 [CS] Initializing RPC server. 0b5::12-20:15:23.609 [INIT] Attempting to join cluster MDLCLUSTER 0b5::12-20:15:23.609 [JOIN] Spawning thread to connect to sponsor 192.88.80.114 06c::12-20:15:23.609 [JOIN] Asking 192.88.80.114 to sponsor us. 0b5::12-20:15:23.609 [JOIN] Waiting for all connect threads to terminate. 06c::12-20:15:32.750 [JOIN] Sponsor 192.88.80.114 is not available, status=1722. 0b5::12-20:15:32.750 [JOIN] All connect threads have terminated. 0b5::12-20:15:32.750 [JOIN] Unable to connect to any sponsor node. 0b5::12-20:15:32.750 [INIT] Failed to join cluster, status 53 0b5::12-20:15:32.750 [INIT] Attempting to form cluster MDLCLUSTER 0b5::12-20:15:32.750 [Ep]: EpInitPhase1 0b5::12-20:15:32.750 [API] Online read only 04b::12-20:15:32.765 [RM] Main: Initializing.
Note that the cluster service attempts to join the cluster. If it cannot connect with an existing member, the software decides to form the cluster. The next series of steps attempts to form groups and resources necessary to accomplish this task. It is important to note that the cluster service must arbitrate control of the quorum disk.
0b5::12-20:15:32.781 [FM] Creating group a1a13a86-0eaf-11d1 -8427-0000f8034599 0b5::12-20:15:32.781 [FM] Group a1a13a86-0eaf-11 d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599. 0b5::12-20:15:32.781 [FM] Creating resource a1a13a87-0eaf- 11d1-8427-0000f8034599 0b5::12-20:15:32.781 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list 0b5::12-20:15:32.781 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427-0000f8034599. 0b5::12-20:15:32.781 [FM] All dependencies for a 1a13a87-0eaf-11d1-8427-0000f8034599 created 0b5::12-20:15:32.781 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-0000f8034599. 0b5::12-20:15:32.781 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in shared resource monitor 0b5::12-20:15:32.812 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 1363016 0dc::12-20:15:32.828 Physical Disk <Disk D:>: Arbitrate returned status 0. 0b5::12-20:15:32.828 [FM] FmGetQuorumResource successful 0b5::12-20:15:32.828 FmpRmOnlineResource: bringing resource a1a13a87-0eaf-11d1-8427-0000f8034599 (resid 1363016) online. 0b5::12-20:15:32.843 [CP] CppResourceNotify for resource Disk D: 0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting type 0 context 8 0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 0 0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 1 0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 388 type 0 context 8 0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership 0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 388 type 0 context 8 0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting type 0 context 9 0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 0 0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 1 0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 389 type 0 context 9 0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership 0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 389 type 0 context 9 0b5::12-20:15:32.843 FmpRmOnlineResource: Resource a1a13a87-0eaf-11d1-8427-0000f8034599 pending 0e1::12-20:15:33.359 Physical Disk <Disk D:>: Online, created registry watcher thread. 090::12-20:15:33.359 [FM] NotifyCallBackRoutine: enqueuing event 04d::12-20:15:33.359 [FM] WorkerThread, processing transition event for a1a13a87-0eaf-11 d1-8427-0000f8034599, oldState = 129, newState = 2. 04d::12-20:15:33.359 [FM] HandleResourceTransition: Resource Name = a1a13a87-0eaf-11d1-8427-0000f8034599 old state=129 new state=2 04d::12-20:15:33.359 [DM] DmpQuoObjNotifyCb: Quorum resource is online 04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb: Own quorum resource, try open the quorum log 04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb: the name of the quorum file is D:\MSCS\quolog.log 04d::12-20:15:33.375 [lm] LogCreate : Entry FileName=D:\MSCS\quolog.log MaxFileSize= 0x00010000 04d::12-20:15:33.375 [lm] LogpCreate : Entry
In this case, the node forms the cluster group and quorum disk resource, gains control of the disk, and opens the quorum logfile. From here, the cluster performs operations with the logfile, and proceeds to form the cluster. This involves configuring network interfaces and bringing them online.
0b5::12-20:15:33.718 [NM] Beginning form process. 0b5::12-20:15:33.718 [NM] Synchronizing node information. 0b5::12-20:15:33.718 [NM] Creating node objects. 0b5::12-20:15:33.718 [NM] Configuring networks & interfaces. 0b5::12-20:15:33.718 [NM] Synchronizing network information. 0b5::12-20:15:33.718 [NM] Synchronizing interface information. 0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry 0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Exit, pLocalXsaction=0x00151c20 dwError=0x00000000 0b5::12-20:15:33.718 [NM] Setting database entry for interface a1a13a7f-0eaf-11d1-8427-0000f8034599 0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Entry 0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Exit, dwError=0x00000000 0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry 0b5::12-20:15:33.875 [dm] DmBeginLocalUpdate Exit, pLocalXsaction=0x00151c20 dwError=0x00000000 0b5::12-20:15:33.875 [NM] Setting database entry for interface a1a13a81-0eaf-11d1-8427-0000f8034599 0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Entry 0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Exit, dwError=0x00000000 0b5::12-20:15:33.875 [NM] Matched 2 networks, created 0 new networks. 0b5::12-20:15:33.875 [NM] Resynchronizing network information. 0b5::12-20:15:33.875 [NM] Resynchronizing interface information. 0b5::12-20:15:33.875 [NM] Creating network objects. 0b5::12-20:15:33.875 [NM] Creating object for network a1a13a7e-0eaf-11d1- 8427-0000f8034599 0b5::12-20:15:33.875 [NM] Creating object for network a1a13a80-0eaf-11d1- 8427-0000f8034599 0b5::12-20:15:33.875 [NM] Creating interface objects. 0b5::12-20:15:33.875 [NM] Creating object for interface a1a13a7f-0eaf-11d1-8427- 0000f8034599. 0b5::12-20:15:33.875 [NM] Registering network a1a13a7e-0eaf-11d1-8427- 0000f8034599 with cluster transport. 0b5::12-20:15:33.875 [NM] Registering interfaces for network a1a13a7e-0eaf-11d1-8427- 0000f8034599 with cluster transport. 0b5::12-20:15:33.875 [NM] Registering interface a1a13a7f-0eaf- 11d1-8427-0000f8034599 with cluster transport, addr 9.9.9.2, endpoint 3003. 0b5::12-20:15:33.890 [NM] Instructing cluster transport to bring network a1a13a7e-0eaf-11d1- 8427-0000f8034599 online. 0b5::12-20:15:33.890 [NM] Creating object for interface a1a13a81-0eaf-11d1- 8427-0000f8034599. 0b5::12-20:15:33.890 [NM] Registering network a1a13a80-0eaf-11d1-8427- 0000f8034599 with cluster transport. 0b5::12-20:15:33.890 [NM] Registering interfaces for network a1a13a80-0eaf-11d1-8427- 0000f8034599 with cluster transport. 0b5::12-20:15:33.890 [NM] Registering interface a1a13a81-0eaf-11d1-8427- 0000f8034599 with cluster transport, addr 192.88.80.190, endpoint 3003. 0b5::12-20:15:33.890 [NM] Instructing cluster transport to bring network a1a13a80-0eaf-11d1- 8427-0000f8034599 online.
After initializing network interfaces, the cluster will continue formation with the enumeration of cluster nodes. In this case, as a newly formed cluster, the cluster will contain only one node. If this session had been joining an existing cluster, the node enumeration would show two nodes. Next, the cluster will bring the Cluster IP address and Cluster Name resources online.
0b5::12-20:15:34.015 [FM] OnlineGroup: setting group state to Online for f901aa29-0eaf-11d1- 8427-0000f8034599 069::12-20:15:34.015 IP address < Cluster IP address>: Created NBT interface \Device\NetBt_ If6 (instance 355833456). 0b5::12-20:15:34.015 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427 -0000f8034599 possible node list 0b5::12-20:15:34.015 [FM] FmFormNewClusterPhase2 complete. . . . 0b5::12-20:15:34.281 [INIT] Successfully formed a cluster. 09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Entry. 09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Exit gdwNumHandles=3 0b5::12-20:15:34.281 [INIT] Cluster Started! Original Min WS is 204800, Max WS is 1413120. 08c::12-20:15:34.296 [CPROXY] clussvc initialized 069::12-20:15:40.421 IP address <Cluster IP Address>: IP Address 192.88.80.114 on adapter DC21X41 online . . . 04d::12-20:15:40.421 [FM] OnlineWaitingTree, a1a13a84-0eaf-11d1-8427-0000f8034599 depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Start first 04d::12-20:15:40.421 [FM] OnlineWaitingTree, Start resource a1a13a84-0eaf-11d1-8427-0000f8034599 04d::12-20:15:40.421 [FM] OnlineResource: a1a13a84-0eaf-11d1-8427-0000f8034599 depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Bring online first. 04d::12-20:15:40.421 FmpRmOnlineResource: bringing resource a1a13a84-0eaf-11d1-8427-0000f8034599 (resid 1391032) online. 04d::12-20:15:40.421 [CP] CppResourceNotify for resource Cluster Name 04d::12-20:15:40.421 [GUM] GumSendUpdate: Locker waiting type 0 context 8 04d::12-20:15:40.437 [GUM] Thread 0x4d UpdateLock wait on Type 0 04d::12-20:15:40.437 [GUM] DoLockingUpdate successful, lock granted to 1 076::12-20:15:40.437 Network Name <Cluster Name>: Bringing resource online... 04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker dispatching seq 411 type 0 context 8 04d::12-20:15:40.437 [GUM] GumpDoUnlockingUpdate releasing lock ownership 04d::12-20:15:40.437 [GUM] GumSendUpdate: completed update seq 411 type 0 context 8 04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker waiting type 0 context 11 . . . 076::12-20:15:43.515 Network Name <Cluster Name>: Registered server name MDLCLUSTER on transport \Device\NetBt_If6. 076::12-20:15:46.578 Network Name <Cluster Name>: Registered workstation name MDLCLUSTER on transport \Device\NetBt_If6. 076::12-20:15:46.578 Network Name <Cluster Name>: Network Name MDLCLUSTER is now online
Following these steps, the cluster will attempt to bring other resources and groups online. The logfile will continue to increase in size as the cluster service runs. Therefore, it may be a good idea to enable this option when you are having problems, rather than leaving it on for days or weeks at a time.
Logfile Entries for Common Failures
After reviewing a successful startup of the Cluster Service, you may want to examine some errors that may appear because of various failures. The following examples illustrate possible log entries for four different failures.
Example 1: Quorum Disk Turned Off
If the cluster attempts to form and cannot connect to the quorum disk, entries similar to the following may appear in the logfile. Because of the failure, the cluster cannot form, and the Cluster Service terminates.
0b9::14-20:59:42.921 [RM] Main: Initializing. 08f::14-20:59:42.937 [FM] Creating group a1a13a86-0eaf-11d1-8427- 0000f8034599 08f::14-20:59:42.937 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599. 08f::14-20:59:42.937 [FM] Creating resource a1a13a87-0eaf-11d1-8427- 0000f8034599 08f::14-20:59:42.937 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427- 0000f8034599 possible node list 08f::14-20:59:42.937 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427- 0000f8034599. 08f::14-20:59:42.937 [FM] All dependencies for a1a13a87-0eaf-11d1-8427- 0000f8034599 created 08f::14-20:59:42.937 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427- 0000f8034599. 08f::14-20:59:42.937 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427- 0000f8034599 in shared resource monitor 08f::14-20:59:42.968 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427- 0000f8034599, resid 1362616 0e9::14-20:59:43.765 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21. 0e9::14-20:59:54.125 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21. 0e9::14-20:59:54.140 Physical Disk <Disk D:>: Arbitrate returned status 21. 08f::14-20:59:54.140 [FM] FmGetQuorumResource failed, error 21. 08f::14-20:59:54.140 [INIT] Cleaning up failed form attempt. 08f::14-20:59:54.140 [INIT] Failed to form cluster, status 3213068. 08f::14-20:59:54.140 [CS] ClusterInitialize failed 21 08f::14-20:59:54.140 [INIT] The cluster service is shutting down. 08f::14-20:59:54.140 [evt] EvShutdown 08f::14-20:59:54.140 [FM] Shutdown: Failover Manager requested to shutdown groups. 08f::14-20:59:54.140 [FM] DestroyGroup: destroying a1a13a86-0eaf-11d1-8427-0000f8034599 08f::14-20:59:54.140 [FM] DestroyResource: destroying a1a13a87-0eaf-11d1-8427-0000f8034599 08f::14-20:59:54.140 [OM] Deleting object Physical Disk 08f::14-20:59:54.140 [FM] Resource a1a13a87-0eaf-11d1-8427-0000f8034599 destroyed. 08f::14-20:59:54.140 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 destroyed. 08f::14-20:59:54.140 [Dm] DmShutdown 08f::14-20:59:54.140 [DM] DmpShutdownFlusher: Entry 08f::14-20:59:54.156 [DM] DmpShutdownFlusher: Setting event 062::14-20:59:54.156 [DM] DmpRegistryFlusher: got 0 062::14-20:59:54.156 [DM] DmpRegistryFlusher: exiting 0ca::14-20:59:54.156 [FM] WorkItem, delete resource <Disk D:> status 0 0ca::14-20:59:54.156 [OM] Deleting object Disk Group 1 (a1a13a86-0eaf-11d1- 8427-0000f8034599) 0e7::14-20:59:54.375 [CPROXY] clussvc terminated, error 0. 0e7::14-20:59:54.375 [CPROXY] Service Stopping... 0b9::14-20:59:54.375 [RM] Going away, Status = 1, Shutdown = 0. 02c::14-20:59:54.375 [RM] PollerThread stopping. Shutdown = 1, Status = 0, WaitFailed = 0, NotifyEvent address = 196. 0e7::14-20:59:54.375 [CPROXY] Cleaning up 0b9::14-20:59:54.375 [RM] RundownResources posting shutdown notification. 0e7::14-20:59:54.375 [CPROXY] Cleanup complete. 0e3::14-20:59:54.375 [RM] NotifyChanges shutting down. 0e7::14-20:59:54.375 [CPROXY] Service Stopped.
Perhaps the most meaningful lines from above are:
0e9::14-20:59:43.765 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21. 0e9::14-20:59:54.125 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21. 0e9::14-20:59:54.140 Physical Disk <Disk D:>: Arbitrate returned status 21.
Note: The error code on these logfile entries is 21. You can issue net helpmsg 21 from the command line and receive the explanation of the error status code. Status code 21 means, "The device is not ready." This indicates a possible problem with the device. In this case, the device was turned off, and the error status correctly indicates the problem.
Example 2: Quorum Disk Failure
In this example, the drive has failed or has been reformatted from the SCSI controller. As a result, the cluster service cannot locate a drive with the specific signature it is looking for.
0b8::14-21:11:46.515 [RM] Main: Initializing. 074::14-21:11:46.531 [FM] Creating group a1a13a86-0eaf-11d1-8427-0000f8034599 074::14-21:11:46.531 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599. 074::14-21:11:46.531 [FM] Creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 074::14-21:11:46.531 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427- 0000f8034599 possible node list 074::14-21:11:46.531 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427- 0000f8034599. 074::14-21:11:46.531 [FM] All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created 074::14-21:11:46.531 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427- 0000f8034599. 074::14-21:11:46.531 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in shared resource monitor 074::14-21:11:46.562 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 1362696 075::14-21:11:46.671 Physical Disk <Disk D:>: SCSI,Performing bus rescan. 075::14-21:11:51.843 Physical Disk <Disk D:>: SCSI,error attaching to signature 71cd0549, error 2. 075::14-21:11:51.843 Physical Disk <Disk D:>: Unable to attach to signature 71cd0549. Error: 2. 074::14-21:11:51.859 [FM] FmGetQuorumResource failed, error 2. 074::14-21:11:51.859 [INIT] Cleaning up failed form attempt.
In this case, the most important logfile entries are:
075::14-21:11:51.843 Physical Disk <Disk D:>: SCSI, error attaching to signature 71cd0549, error 2. 075::14-21:11:51.843 Physical Disk <Disk D:>: Unable to attach to signature 71cd0549. Error: 2.
Status code 2 means, "The system cannot find the file specified." The error in this case may mean that it cannot find the disk, or that, because of some kind of problem, it cannot locate the quorum logfile that should be on the disk.
Example 3: Duplicate Cluster IP Address
If another computer on the network has the same IP address as the cluster IP address resource, the resource will be prevented from going online. Further, the cluster name will not be registered on the network, as it depends on the IP address resource. Because this name is the network name used for cluster administration, you will not be able to administer the cluster using this name, in this type of failure. However, you may be able to use the computer name of the cluster node to connect with Cluster Administrator. Additionally, you may be able to connect locally from the console using the loopback address. The following sample entries are from a cluster logfile during this type of failure:
0b9::14-21:32:59.968 IP Address <Cluster IP Address>: The IP address is already in use on the network, status 5057. 0d2::14-21:32:59.984 [FM] NotifyCallBackRoutine: enqueuing event 03e::14-21:32:59.984 [FM] WorkerThread, processing transition event for a1a13a83-0eaf-11d1-8427-0000f8034599, oldState = 129, newState = 4.03e . . . 03e::14-21:32:59.984 FmpHandleResourceFailure: taking resource a1a13a83-0eaf-11d1-8427-0000f8034599 and dependents offline 03e::14-21:32:59.984 [FM] TerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Terminating first 0d3::14-21:32:59.984 Network Name <Cluster Name>: Terminating name MDLCLUSTER... 0d3::14-21:32:59.984 Network Name <Cluster Name>: Name MDLCLUSTER is already offline. . . . 03e::14-21:33:00.000 FmpRmTerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 is now offline 0c7::14-21:33:00.000 IP Address <Cluster IP Address>: Terminating resource... 0c7::14-21:33:00.000 IP Address <Cluster IP Address>: Address 192.88.80.114 on adapter DC21X41 offline.
Example 4: Evicted Node Attempts to Join Existing Cluster
If you evict a node from a cluster, the cluster software on that node must be reinstalled to gain access to the cluster again. If you start the evicted node, and the Cluster Service attempts to join the cluster, entries similar to the following may appear in the cluster logfile:
032::26-16:11:45.109 [INIT] Attempting to join cluster MDLCLUSTER 032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 192.88.80.115 040::26-16:11:45.109 [JOIN] Asking 192.88.80.115 to sponsor us. 032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 9.9.9.2 032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 192.88.80.190 099::26-16:11:45.109 [JOIN] Asking 9.9.9.2 to sponsor us. 032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor NODEA 098::26-16:11:45.109 [JOIN] Asking 192.88.80.190 to sponsor us. 032::26-16:11:45.125 [JOIN] Waiting for all connect threads to terminate. 092::26-16:11:45.125 [JOIN] Asking NODEA to sponsor us. 040::26-16:12:18.640 [JOIN] Sponsor 192.88.80.115 is not available (JoinVersion), status=1722. 098::26-16:12:18.640 [JOIN] Sponsor 192.88.80.190 is not available (JoinVersion), status=1722. 099::26-16:12:18.640 [JOIN] Sponsor 9.9.9.2 is not available (JoinVersion), status=1722. 098::26-16:12:18.640 [JOIN] JoinVersion data for sponsor 157.57.224.190 is invalid, status 1722. 099::26-16:12:18.640 [JOIN] JoinVersion data for sponsor 9.9.9.2 is invalid, status 1722. 040::26-16:12:18.640 [JOIN] JoinVersion data for sponsor 157.58.80.115 is invalid, status 1722. 092::26-16:12:18.703 [JOIN] Sponsor NODEA is not available (JoinVersion), status=1722. 092::26-16:12:18.703 [JOIN] JoinVersion data for sponsor NODEA is invalid, status 1722. 032::26-16:12:18.703 [JOIN] All connect threads have terminated. 032::26-16:12:18.703 [JOIN] Unable to connect to any sponsor node. 032::26-16:12:18.703 [INIT] Failed to join cluster, status 0 032::26-16:12:18.703 [INIT] Attempting to form cluster MDLCLUSTER . . . 032::26-16:12:18.734 [FM] arbitrate for quorum resource id 24acc093-1e28-11d1-9e5d-0000f8034599. 032::26-16:12:18.734 [FM] FmpQueryResourceInfo:initialize the resource with the registry information 032::26-16:12:18.734 FmpRmCreateResource: creating resource 24acc093-1e28-11d1-9e5d-0000f8034599 in shared resource monitor 032::26-16:12:18.765 FmpRmCreateResource: created resource 24acc093-1e28-11d1-9e5d-0000f8034599, resid 1360000 06d::26-16:12:18.812 Physical Disk <Disk G:>: SCSI, error attaching to signature b2320a9b, error 2. 06d::26-16:12:18.812 Physical Disk <Disk G:>: Unable to attach to signature b2320a9b. Error: 2. 032::26-16:12:18.812 [FM] FmGetQuorumResource failed, error 2. 032::26-16:12:18.812 [INIT] Cleaning up failed form attempt. 032::26-16:12:18.812 [INIT] Failed to form cluster, status 2. 032::26-16:12:18.828 [CS] ClusterInitialize failed 2
The node attempts to join the existing cluster, but has invalid credentials, because it was previously evicted. Therefore, the existing node refuses to communicate with it. The node may attempt to form its own version of the cluster, but cannot gain control of the quorum disk, because the existing cluster node maintains ownership. Examination of the logfile on the existing cluster node reveals that the Cluster Service posted entries to reflect the failed attempt to join:
0c4::29-18:13:31.035 [NMJOIN] Processing request by node 2 to begin joining. 0c4::29-18:13:31.035 [NMJOIN] Node 2 is not a member of this cluster. Cannot join.
Appendix C: Command-Line Administration
You can perform many of the administrative tasks for MSCS from the Windows NT command prompt, without using the provided graphical interface. While the graphical method provides easier administration and status of cluster resources at a glance, MSCS does provide the capability to issue most administrative commands without the graphical interface. This ability opens up interesting possibilities for batch files, scheduled commands, and other techniques, in which many tasks may be automated.
Using Cluster.exe
Cluster.exe is a companion program and is installed with Cluster Administrator. While the Microsoft Cluster Server Administrator's Guide details basic syntax for this utility, the intention of this section is to complement the existing documentation and to offer examples. All examples in this section assume a cluster name of MYCLUSTER, installed in the domain called MYDOMAIN, with NODEA and NODEB as servers in the cluster. All examples are given as a single command line.
Note: Specify any names that contain spaces within quotation marks.
Basic Syntax
With the exception of the cluster /? command, which returns basic syntax for the command, every command line uses the syntax:
CLUSTER [cluster name] /option
To test connectivity with a cluster, or to ensure you can use Cluster.exe, try the simple command in the next section to check the version name (/version).
Cluster Commands
Version Number
To check the version number of your cluster, use a command similar to the following:
CLUSTER mycluster /version
If your cluster were named MYCLUSTER, the above command would return the version information for the product.
Listing Clusters in the Domain
To list all clusters within a single domain, use a command including the /list option like this:
CLUSTER mycluster /LIST:mydomain
Node Commands
All commands directed toward a specific cluster node must use the following syntax:
CLUSTER [cluster name] NODE [node name] /option
Node Status
To obtain the status of a particular cluster node, use the /status command. For example:
CLUSTER mycluster NODE NodeA /Status
The node name is optional only for the /status command, so the following command will report the status of all nodes in the cluster:
CLUSTER mycluster NODE /Status
Pause or Resume
The pause option allows the cluster service to continue running and communicating in the cluster. However, the paused node may not own groups or resources. For example, to pause a node, use the /pause switch:
CLUSTER mycluster NODE NodeB /Pause
An example of the use of this command might be to transfer groups to another node while you perform some other kind of task, such as with a backup or disk defrag utility. To resume the node, simply use the /resume switch instead:
CLUSTER mycluster NODE NodeB /Resume
Evict a Node
The evict option removes the ability of a node to participate in the cluster. In other words, the cluster node loses membership rights in the cluster. The only way to grant membership rights again to the evicted node is:
Remove the cluster software from the evicted node through Add/Remove Programs in Control Panel.
Restart the node.
Reinstall MSCS on the previously evicted node through the MSCS Setup program.
To perform this action, use a command similar to the following:
CLUSTER mycluster NODE NodeB /Evict
Changing Node Properties
While the cluster node only has one property that may be changed by Cluster.exe, this example illustrates how to change a property of a cluster resource. The node description is the only property that may be changed. For example:
CLUSTER mycluster NODE NodeA /Properties Description=" The best node in MyCluster."
A good use for this node changing property might be in the case of multiple administrators. For example, you pause a node to run a large application on the designated node, and want to change the node description to reflect this. The field could serve as a reminder to yourself and to other administrators as to why it was paused — and that someone may want to /resume the node later. It might be good to include /resume in a batch file that might pause a node while setting up for the designated task.
Group Commands
All group commands use the syntax:
CLUSTER [cluster name] GROUP [group name] /option
Group Status
To obtain the status of a group, you may use the /status option. This option is the only group option in which the group name is optional. Therefore, if you omit the group name, the status of all groups will be displayed. Another status option (/node) will display group status by node.
Example 1: Status of all groups:
CLUSTER mycluster GROUP /Status
Example 2: Status of all groups owned by a specific node:
CLUSTER mycluster GROUP /Status /Node:nodea
Example 3: Status of a specific group:
CLUSTER mycluster GROUP "Cluster Group"
Create a New Group
It is easy to create a new group from the command line.
Note: The following example creates a group called mygroup:
CLUSTER mycluster GROUP mygroup /create
Delete a Group
Equally as simple as the /create option, you may delete groups from the command line. However, the group must be empty before it can be deleted.
CLUSTER mycluster GROUP mygroup /delete
Rename a Group
To rename a group, use the following syntax:
CLUSTER mycluster GROUP mygroup /rename:yourgroup
Move, Online, and Offline Group Commands
The move group command may be used to transfer ownership of a group and its resources to another node. By design, the move command must take the group offline and bring it online on the other node. Further, a timeout value (number of seconds) may be supplied to specify the time to wait before cancellation of the move request. By default, Cluster.exe waits indefinitely until the state of the group changes to the desired state.
Examples:
CLUSTER mycluster GROUP mygroup /MoveTo:Nodeb /wait:120 CLUSTER mycluster GROUP mygroup /Offline CLUSTER mycluster GROUP mygroup /Online
Group Properties
Use the /property option to display or set group properties. Documentation on common properties for groups may be found in the Microsoft Cluster Server Administrator's Guide. One additional property not documented is LoadBalState. This property is not used in MSCS version 1.0, and is reserved for future use.
Examples:
CLUSTER mycluster GROUP mygroup /Properties CLUSTER mygroup GROUP mygroup /Properties Description="My favorite group"
Preferred Owner
You may specify a preferred owner for a group. The preferred owner is the node you prefer each group to run. If a node fails, the remaining node takes over the groups from the failed node. By setting the fail back option at the group level, groups may fail back to their preferred server when the node becomes available. A group does not fail back if a preferred owner is not specified. MSCS version 1.0 is limited to two nodes in a cluster. For best results, specify no more than one preferred owner. In future releases, this property may use a list of more than one preferred owner.
Example: To list the preferred owner for a group, type:
CLUSTER mycluster GROUP mygroup /Listowner
Example: To specify the preferred owner list, type:
CLUSTER mycluster GROUP mygroup /Setowners:Nodea
Resource Commands
Resource Status
To list the status of resources or a particular resource, you can use the /status option. Note the following examples:
CLUSTER mycluster RESOURCE /Status CLUSTER mycluster RESOURCE myshare /Status
Create a New Resource
To create a new resource, use the /create option.
Note: To avoid error, you must specify all required parameters for the resource. The /create option allow creation of resources in an incomplete state. Make sure to set additional resource properties as appropriate with subsequent commands.
Example: Command sequence to add a file share resource
CLUSTER mycluster RESOURCE myshare /Create /Group:mygroup /Type:"File Share" CLUSTER mycluster RESOURCE myshare /PrivProp ShareName="myshare" CLUSTER mycluster RESOURCE myshare /PrivProp Path="w:\myshare" CLUSTER mycluster RESOURCE myshare /PrivProp Maxusers=-1 CLUSTER mycluster RESOURCE myshare /AddDependency:"Disk W"
Note: Log entry lines in the sections above have been wrapped for space constraints in this document. The lines do not normally wrap.
Simulating Resource Failure
You can simulate resource failure in a cluster from the command line by using the /fail option for a resource. This option is similar to using the Initiate Failure command from Cluster Administrator. The command assumes that the resource is already online.
Example:
CLUSTER mycluster RESOURCE myshare /Fail
Online/Offline Resource Commands
The /online and /offline resource commands work very much the same way as the corresponding group commands, and also may use the /wait option to specify a time limit (in seconds) for the operation to complete.
Examples:
CLUSTER mycluster RESOURCE myshare /Offline CLUSTER mycluster RESOURCE myshare /Online
Dependencies
Resource dependency relationships may be listed or changed from the command line. To add or remove a dependency, you must know the name of the resource to be added or removed as a dependency.
Examples:
CLUSTER mycluster RESOURCE myshare /ListDependencies CLUSTER mycluster RESOURCE myshare /AddDependency:"Disk W:" CLUSTER mycluster RESOURCE myshare /RemoveDependency:"Disk W:"
Note: Log entry lines in the sections above have been wrapped for space constraints in this document. The lines do not normally wrap.
Example Batch Job
The following example takes an existing group, Mygroup, and creates resources within the group. The example creates a network name resource, and initiates failures to test failover. During the process, it uses various reporting commands to obtain the status of the group and resources. This example shows the output from all commands given. The commands in this example work, but may require minor alteration depending on configured cluster, group, resource, network, and IP addresses in your environment — if you choose to use them.
Note: The LoadBal properties reported in the example are reserved for future use. The EnableNetBIOS property for the IP address resource is a Service Pack 4 addition, and must be set to 1, for the resource to be a valid dependency for a network name resource.
C:\>REM Get group status C:\>CLUSTER mycluster GROUP mygroup /status Listing status for resource group 'mygroup': Group Node Status -------------------- --------------- ------ mygroup NodeA Online C:\>REM Create the IP Address resource: myip C:\>CLUSTER mycluster RESOURCE myip /create /Group:mygroup /Type:"Ip Address" Creating resource 'myip'... Resource Group Node Status -------------------- -------------------- --------------- ------ myip mygroup NodeA Offline C:\>REM Define the IP Address parameters C:\>CLUSTER mycluster RESOURCE myip /priv network:client C:\>CLUSTER mycluster RESOURCE myip /priv address:157.57.152.23 C:\>REM Redundant. Subnet mask should already be same as network uses. C:\>CLUSTER mycluster RESOURCE myip /priv subnetmask:255.255.252.0 C:\>CLUSTER mycluster RESOURCE myip /priv EnableNetBIOS:1 C:\>REM Check the status C:\>CLUSTER mycluster RESOURCE myip /Stat Listing status for resource 'myip': Resource Group Node Status -------------------- -------------------- --------------- ------ myip mygroup NodeA Offline C:\>REM View the properties C:\>CLUSTER mycluster RESOURCE myip /prop Listing properties for 'myip': R Name Value --------------------------------- ------------------------------- R Name myip Type IP Address Description DebugPrefix SeparateMonitor 0 (0x0) PersistentState 0 (0x0) LooksAlivePollInterval 5000 (0x1388) IsAlivePollInterval 60000 (0xea60) RestartAction 2 (0x2) RestartThreshold 3 (0x3) RestartPeriod 900000 (0xdbba0) PendingTimeout 180000 (0x2bf20) LoadBalStartupInterval 300000 (0x493e0) LoadBalSampleInterval 10000 (0x2710) LoadBalAnalysisInterval 300000 (0x493e0) LoadBalMinProcessorUnits 0 (0x0) LoadBalMinMemoryUnits 0 (0x0) C:\>REM View the private properties C:\>CLUSTER mycluster RESOURCE myip /priv Listing private properties for 'myip': R Name Value --------------------------------- ------------------------------- Network Client Address 157.57.152.23 SubnetMask 255.255.252.0 EnableNetBIOS 1 (0x1) C:\>REM Bring online and wait 60 sec. for completion C:\>CLUSTER mycluster RESOURCE myip /Online /Wait:60 Bringing resource 'myip' online... Resource Group Node Status -------------------- -------------------- --------------- ------ myip mygroup NodeA Online C:\>REM Check the status again. C:\>CLUSTER mycluster RESOURCE myip /Stat Listing status for resource 'myip': Resource Group Node Status -------------------- -------------------- --------------- ------ myip mygroup NodeA Online C:\>REM Define a network name resource C:\>CLUSTER mycluster RESOURCE mynetname /Create / Group:mygroup /Type:"Network Name" Creating resource 'mynetname'... Resource Group Node Status -------------------- -------------------- --------------- ------ mynetname mygroup NodeA Offline C:\>CLUSTER mycluster RESOURCE mynetname /priv Name:"mynetname" C:\>CLUSTER mycluster RESOURCE mynetname /Adddependency:myip Making resource 'mynetname' depend on resource 'myip'... C:\>REM Status check C:\>CLUSTER mycluster RESOURCE mynetname /Stat Listing status for resource 'mynetname': Resource Group Node Status -------------------- -------------------- --------------- ------ mynetname mygroup NodeA Offline C:\>REM Bring the network name online C:\>CLUSTER mycluster RESOURCE mynetname /Online /Wait:60 Bringing resource 'mynetname' online... Resource Group Node Status -------------------- -------------------- --------------- ------ mynetname mygroup NodeA Online C:\>REM Status check C:\>CLUSTER mycluster Group mygroup /stat Listing status for resource group 'mygroup': Group Node Status -------------------- --------------- ------ mygroup NodeA Online C:\>REM Let's simulate a failure of the IP address C:\>CLUSTER mycluster RESOURCE myip /Fail Failing resource 'myip'... Resource Group Node Status -------------------- -------------------- --------------- ------ myip mygroup NodeA Online Pending C:\>REM Get group status C:\>CLUSTER mycluster GROUP mygroup /status Listing status for resource group 'mygroup': Group Node Status -------------------- --------------- ------ mygroup NodeA Online
For More Information
For the latest information about Windows NT Server, check out Microsoft TechNet, visit https://www.microsoft.com/backofficeserver/ or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).
For the latest information on Windows NT Server, Enterprise Edition, and Microsoft Cluster Server, use the following links:
https://www.microsoft.com/ntserver/default.asp
https://support.microsoft.com/ph/3194
Alpha AXP is a trademark of Digital Equipment Corporation.
DEC is a trademark of Digital Equipment Corporation.
Intel is a registered trademark of Intel Corporation.
IBM is a registered trademark of International Business Machines Corporation.
PowerPC is a trademark of International Business Machines Corporation.
MIPS is a registered trademark of MIPS Computer Systems, Inc.