Troubleshooting Cluster Deployment
Applies To: Windows Compute Cluster Server 2003
Common Deployment Problems
The problems that you might encounter when deploying Windows Compute Cluster Server 2003 will fall generally into the following areas:
Setup
Networking
Remote Installation Services (RIS)
Node management
Setup
Compute Cluster Pack Setup on the head node sometimes requires rebooting the server and restarting Setup.
- Windows Compute Cluster Pack (CCP) requires certain software prerequisites before it can be installed. During installation of a head node, CCP Setup will detect any missing prerequisites and provide links to external Internet sites where the software can be downloaded and installed. When all prerequisite software is installed, then the installation program installs Compute Cluster Pack. In some situations, particularly when the head node is a domain controller, a given prerequisite may require rebooting the server before head node setup is completed. After the server is restarted, you will have to run CCP Setup again, and continue installing any remaining prerequisites and then install Compute Cluster Pack itself. The best way to avoid this is to install all prerequisite software on the head node server BEFORE running CCP Setup. See Compute Cluster Software Requirements for details.
Unattended Compute Cluster Pack Setup on the head node does not run to completion.
- The most common reason unattended head node setup (for example, from the command line or from a script) does not complete is that a prerequisite is not installed. Make sure that prerequisite software and updates are installed before performing an unattended head node setup. Refer to Compute Cluster Software Requirements for a list of hotfix and software prerequisites.
Compute Nodes are added to the cluster but do not appear in the node list.
Verify that the Compute Cluster Pack services are installed and started on the compute node. See Compute Cluster Pack Directories and Services
Check network connectivity and name resolution between the compute cluster head node and the compute node.
If a quick check verifies that the compute node has network connectivity and that Compute Cluster Pack services are running, try adding the node to the cluster using the Manual Addition method. See Adding Compute Nodes to the Cluster.
Verify that the name of the cluster head node is specified in the following registry key on the compute node:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\CCP ClusterName=HeadNode
If the specified head node is incorrect, then
Remove Compute Cluster Pack (CCP) from the compute node using Add or Remove Programs. Then reinstall CCP on the node, specifying the correct cluster head node name.
Edit the registry of the compute node, specifying the correct head node name. Then restart the Compute Cluster Pack services on the compute node.
Look in the event log of both head node and compute node for Compute Cluster Pack error messages. If you want to turn on verbose logging, see Enabling Verbose Logging.
Networking
Connectivity
Check basic connectivity between systems.
Are IP Addresses assigned on each machine, either statically or via DHCP?
Is Name Resolution (via DNS) working in both directions between servers?
Are the networking drivers are installed and up-to-date?
Are all cables are properly connected?
Are all switches and routers are functioning properly?
Are your users exceeding the number of concurrent connection allowed by your licensing agreement?
Compute nodes on the private network cannot access resources on the public network
Confirm that NAT is enabled on the head node and functioning properly.
If the head node is a domain controller, confirm that DNS Forwarding is properly configured. Confirm that Name Resolution is working properly.
If you are using IP Security (IPsec) in your environment, investigate whether domain or server isolation policies are causing this problem. If you are using IPsec, you might consider making the head node a boundary server. In addition, when a private network isolates the compute nodes and ICS is enabled, sessions initiated by the compute nodes may fail if the default cluster private IP address range 192.168.0.0/24 is on the IPsec exempt list.
Compute nodes cannot access the Internet.
Compute nodes might need access to the Internet for license authorization, file exchange, or other reasons. If nodes can access resources on the public network but cannot access the Internet, do the following:
If you are using a cluster topology that isolates the compute nodes from the public network and enabled ICS NAT when setting up the head node, (or provided other means of providing network address translation), check that NAT is working properly. Verify that compute nodes can access resources such as file shares on the public network.
If compute nodes can access public network resources, but cannot access the Internet, verify that the correct proxy server is specified on each compute node. Automatic Proxy Server detection by compute nodes will fail if ICS NAT is enabled on the head node. In these scenarios, the cluster administrator must specify the name of the proxy server in an Internet Explorer options setting on each compute node. This can be done manually, using Group Policy, by scripting performed while a node is paused, and other means.
Nodes cannot join Active Directory
Ensure that the user account being used to add the machines has the necessary permissions.
For compute nodes on a private network, confirm that NAT is functioning properly on the head node.
Check for basic connectivity between the node and the Domain Controller.
Ensure that all networking drivers are installed and up-to-date. For RIS installations, make sure that these up-to-date networking drivers are properly included in the RIS image.
A node has a status of Unreachable.
By default, the head node attempts to contact each compute node once a minute. When contact fails three consecutive times, the node status is changed to Unreachable. The most probable reasons are that the node is disconnected from the network or that name resolution has failed.
If the node becomes unreachable, attempt to ping the node from the head node. If you can ping the node, check name resolution for that node.
Confirm that name resolution is functioning properly, both from the head node to the compute node and from the compute node back to the head node.
Confirm that all CCP services are running on the compute node. If they are not, check the log files and view events to diagnose the problem.
Note
The frequency of node health queries made to cluster nodes by the head node can be changed by the administrator using the cluscfg command with the HeartbeatInterval parameter and specifying a new interval. The number of times a node can fail to respond before being marked as Unreachable is also user-configurable. See the Compute Cluster Server Command Line Interface Reference in the Windows Compute Cluster Server 2003 User's Guide.
Conflict Between GPOs and Manage Windows Firewall Settings through ToDoList
If you enable Windows Firewall on the public network using the Manage Windows Firewall Settings wizard in the ToDoList, and also have a Group Policy object (GPO) in Active Directory that disables it some or all compute nodes belonging to an organizational unit, the two settings will override one another. If you are using GPOs to enable/disable Windows firewall, ensure that you configure the same setting through the ToDOList.
MPI traffic goes to private network when MPI network is present
Switching MPI traffic to the private network is normal behavior when the system detects that the MPI network is offline. When this happens, the network status is displayed as offline and MPICH_NETMASK defaults to the private network. (MPICH_NETMASK is the clusterwide environment variable that identifies the IP address and netmask of the MPI network adaptor.) The cause may be a physical connectivity problem with the MPI adaptor or it may be a failure in the MPI network driver.
RIS
Remote Installation Services cannot be used unless certain prerequisites are present, such as a dedicated private cluster network and a valid RIS image with a valid license key (PID). For a list of RIS prerequisites, see Installing Remote Installation Services.
RIS fails when adding compute nodes because servers have duplicate GUIDs.
A GUID is a 128-bit integer (16 bytes) used by Microsoft operating systems that can be used across all computers and networks wherever a unique identifier is required. Such an identifier has a very low probability of being duplicated.
Every computer is usually assigned a unique GUID by the original equipment manufacturer (OEM). However, in rare cases, OEMs assign the same GUID to multiple computers. When compute nodes have duplicate GUIDs, the Automated Addition method of adding nodes will fail. This failure occurs because Remote Installation Services (RIS) uses the node's GUID when creating a new computer account in Active Directory. If any compute nodes have duplicate GUIDs, RIS will not be able to create unique computer accounts in Active Directory for each compute node. As a result, automated installation will fail.
Contact the OEM and obtain a BIOS update for each computer involved.
Edit the registry on the head node, placing the duplicated GUID on the Banned GUID list.
The computer GUID can be seen in the PXE boot phase of computer startup. If you find duplicate GUIDs among the computers that you intend to use as nodes, access the head node and edit the registry. Add the duplicated GUID to a registry key named BannedGuids located under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BINLSVC\Parameters.
Note
Modification of this registry setting (i.e. the addition of a Banned GUID) must be accomplished while the RIS service running on the head node is stopped (running net stop binlsvc from a command window) and the Compute Cluster Administrator Console snap-in is closed. Once the modification is made, the Compute Cluster Administrator Console snap-in can be opened again and automated deployments of compute nodes initiated.
If any of the GUIDs in the Banned GUID list are detected during PXE boot, RIS automatically uses the MAC address of the private network adaptor with the last 12 digits of the GUID. This creates a unique identifier for each computer. RIS then creates the computer account in Active Directory using this identifier, which solves the duplicate GUID problem.
Compute node fails to connect to PXE service when Automated Addition method is used.
When RIS is enabled on the head node, sometimes a node will fail to connect to the PXE boot service on the head node.
Verify that there is a DHCP service running on the private network. Enabling ICS on the head node will provide mini-DHCP services to the private network interfaces of the compute nodes in the cluster. Otherwise, provide DHCP service on the private network.
Verify that the compute node is capable of and configured to boot using PXE.
Verify in the computer BIOS that PXE boot or network boot server is listed first in the boot order.
Confirm that the private network adaptor on the node supports PXE booting.
You can see PXE boot failures by viewing at the boot sequence while it is happening on the compute node. You can also view PXE boot failures by looking in the RIS log file (Binlsvc.log). This file is displayed in summary form by the Automated Add Node Wizard at the end of the automated add node process. To view all detail in the log file, select Display debug log on the Image Nodes page of the wizard. Binlsvc.log is located on the head node in C:\Windows\Debug.
Compute Nodes boot using PXE, but Setup fails in text mode.
- During the text mode portion of the Windows operating system installation, Windows detects the basic system devices, formats the hard disk, and copies installation files from the RIS image on the head node. If installation stops during text mode without formatting the disk or copying any files, the most likely cause is that your compute nodes require text mode drivers that are not included in the RIS image being used. Modify the RIS image to include these drivers. See Modifying a RIS Image.
Compute Nodes boot using PXE, the text mode portion of Setup completes, but at the end of the graphical portion of Setup, the node lacks network connectivity and although a compute account is created in Active Directory, the node is not joined to any Active Directory domain.
- The most likely cause is that your compute nodes require Plug and Play drivers that are not included in the RIS image being used. Modify the RIS image to include these drivers. See Modifying a RIS Image.
Compute Nodes boot using PXE but in the text mode portion of Setup, an error dialog reports that unsigned drivers are being used and Setup waits for user input.
- Microsoft recommends that you use signed drivers. However if you need to use unsigned drivers, then disable signing policy in the [Unattended] section of RISTNDRD.SIF in the RIS image you are using. See Modifying a RIS Image.
Compute nodes cannot perform a PXE network boot
Verify that the node BIOS is correctly configured to initially boot using PXE before trying other media.
Verify that the private network adaptor on the node supported PXE booting.
Verify that the node has a unique hardware GUID. When two compute nodes have the same GUID, RIS will fail. See the discussion of duplicate GUIDs earlier in this section.
Verify that DHCP service is running on the private network. If compute nodes cannot obtain IP addresses from a DHCP server during startup, PXE boot will fail.
Node management
Manual Addition of a cluster node fails
Common reasons manual addition fails include:
Entering the wrong computer name for the node when using the Manual Add Node Wizard.
No network connectivity to node.
Compute Node services are not installed on the node.
Applications running on Compute nodes cannot communicate with each other
If your cluster topology has a private network, check name resolution on the private network.
Look in any application logs on the cluster nodes.
Application running on the public network cannot access client applications running on cluster compute nodes.
- Some applications use a specific port for communication with client applications or other services. If this is true for your application, perhaps the port is being blocked by the cluster firewall. Check that the port is specified on the list of excepted ports on the Windows Firewall running on the head node.
An application running on the cluster cannot access resources on the public network.
Verify that network address translation is operating correctly on the cluster head node if you are using cluster topology scenarios 1 or 3, where the compute nodes are isolated from the public network behind the head node.
Check name resolution (DNS).
A user on the public network cannot establish a Remote Desktop connection to a compute node
- If the compute nodes in your cluster are isolated from the public network, as in topology scenarios 1 and 3, a user must first create a remote desktop session to the cluster head node. From the head node, a user can create a remote desktop connection to the compute node, either using tools bundled with the operating system or by running node actions using the Compute Cluster Administrator.
Remote Command Execution stops responding.
- Verify that the command you are running on the node does not provide an interactive user interface. An example of a program with a user interface is notepad.exe or regedit.exe.
A newly added compute node remains in a state of Configuring.
- The Configuring node state is an initial transition state, usually brief, shown during Compute Cluster Pack Setup, Compute nodes should automatically move from the Configuring state to the Pending for Approval state. If the node remains for any length of time in the Configuring state, this is typically an indication of a problem in the configuration phase of setup.