Troubleshooting PXE Boot Failures During Baremetal Node Deployment

 

Applies To: Windows HPC Server 2008 R2, Windows HPC Server 2008

This topic lists the most common causes of Pre-Boot Execution Environment (PXE) failures when deploying compute nodes from bare metal or reimaging in Windows® HPC Server 2008 R2 and Windows® HPC Server 2008. This topic also provides procedures for diagnosing and resolving some common causes of PXE boot failure.

In this topic:

Common causes

PXE boot fails for the following reasons:

  • Compute node boot order is not configured correctly in the BIOS (network adapters should be listed before hard disk, DVD drive, or other media).

  • The wrong NIC is selected in the BIOS (if only one is selected).

  • Network cables are swapped between the Private and the Application networks (PXE boot over IB networks is not supported).

  • Dynamic Host Configuration Protocol (DHCP) Server or Windows Deployment Services are misconfigured or not running.

  • HPC Management Service is not running.

  • Node is unidentified and HPC deployment is configured to ignore unknown nodes.

  • Group Policy is blocking PXE responses (mostly in enterprise setting where the head node is not a boundary server or where the cluster is part of the enterprise domain).

  • Bad firmware on NIC.

  • Switch is misconfigured.

  • Mismatched MAC address or GUID information.

What happens when deploying operating systems/reimaging nodes

The following steps outline what happens when deploying operating systems to compute nodes in Windows HPC Server. The steps assume that the head node is configured as the Windows Deployment Server for the cluster (this is the default configuration), and that the cluster has a private network (bare metal deployment is not supported on Topology 5).

  1. The compute node boots from the NIC on the Private network (if correctly configured in the BIOS boot order) and send out a PXE request.

  2. The head node responds and sends the boot.wim file (WinPE) to the compute nodes using Trivial File Transfer Protocol (TFTP). This file is approximately 148 MB.

  3. The compute node loads the WinPE file and does the following:

    1. Sets up networking (IP address, etc.)

    2. Maps a network drive to the remote install share (a volume on the head node)

    3. Copies the disk partition information (diskpart.txt)

    4. Partitions the disk using a script for diskpart

    5. Asks the head node for the operating system image

  4. The head node sends the operating system image to the compute nodes in one of the following ways:

    • Multicast (requires a switch that supports multicasting)

    • Try Multicast, and if that fails, use Unicast (Unicast Fallback set to True in the node template)

    • Unicast

    Note

    If your switch does not support Multicast, deployment usually fails through to Unicast automatically. For more information about monitoring and troubleshooting this phase, see Use Windows Performance Monitor to troubleshoot early phases of deployment later in this topic.

Diagnosis

The following methods can help you troubleshoot PXE boot failures:

The verify resolution, initiate PXE boot again by restarting the compute node.

Is the compute node BIOS configured for PXE boot?

If the compute node is properly configured for PXE boot, you should see words like PXE Network Boot, Boot Agent PXE in the first couple of lines that are displayed on the compute node.

If you see that the compute node attempts to boot from a CD, DVD, or hard drive, or if it boots into an existing operating system, then PXE boot is either disabled, or it is not the first option.

Verify in the configuration of the BIOS of the compute node that the computer will boot from the network adapter that is connected to the Private network, instead of booting from the local hard drive or another device, and that Pre-boot Execution Environment (PXE) boot is enabled for that network adapter.

Consult the documentation for your hardware regarding how to enable PXE boot and configure the boot order so that PXE boot is the first option.

Can the compute node contact the DHCP server?

If you see an error message displayed on the compute node indicating that the node was unable to contact a DHCP server, such as No DHCP offers were received, verify DHPC configuration, verify that the service is running, and try restarting the service.

DHCP configuration

When you install HPC Pack 2008, the head node is configured with the DHPC Server role. A DHCP server assigns IP addresses to network clients. Depending on the detected configuration of your HPC cluster and the network topology that you choose for your cluster, the compute nodes will receive IP addresses from either the head node running DHCP, or from a dedicated DHCP server on the private network, or via DHCP services coming from a server on the enterprise network.

Considerations for DHCP server configuration:

  • If you have more than one DHCP server on the same network, verify that their scopes do not overlap.

  • If you want to use an existing DHCP server for your private network, ensure that it is configured to recognize the head node as the Windows Deployment Services server in the network. For more information about how to use the Windows Deployment Services role, see Windows Deployment Services Step-by-Step Guide.

  • If you want to enable DHCP server on your head node for the private or application networks and there are other DHCP servers connected to those networks, you must disable those DHCP servers.

Restart the DHCP Server service

The following procedure describes how to restart the DHCP Server service on the head node. If you are using a different DHPC server, the following procedures can be helpful as a guideline.

To restart the DHCP Server service

  1. On the head node, click Start, point to Administrative Tools, then click Server Manager.

  2. Expand the Roles folder, then select DHPC Server.

  3. In the view pane, in Summary, under System Services, select DHCP Server and then click Restart.

Note

If you see error messages or icons for the DHCP role that is associated with the Private network for your head node, try uninstalling and the reinstalling the role.

Are the IP addresses for the DHCP server and the compute node correct?

If a DHCP Server responds, the compute node displays the IP address of the DHPC Server and the IP address that the DHCP Server assigned to the compute node.

Verify that the IP address of the DHCP server matches the expected address. For example, if your head node is the DHCP server for your cluster, then you should see the IP address of your head node that is associated with the private network.

Verify that the IP address that is assigned to the compute node is in the scope of the IPv4 private network for you cluster.

If the IP address for the server corresponds to the application network, your application and private network cables might be swapped. If the application and private network cables are swapped, you might also see a No bootfile name received error. Swap the cables and restart the compute node.

Does the compute node show a “No bootfile name received” error?

If you see the No bootfile name received error, try the following:

  • Restart the HPC Management Service.

  • Verify that the compute node is attempting to boot from the Private network (which is bound to the Windows Deployment Service).

  • Verify that the Windows Deployment Services on your cluster is configured and try restarting the service.

  • Verify that HPC deployment is set to respond to all PXE requests.

  • Verify that the MAC addresses and GUIDs match.

Restart the HPC Management Service

When a node boots into PXE it contacts the DHCP server and receives an IP address assignment. Then the DHCP server contacts the HPC Node Management Service on the head node and gives it the IP address of the new node. The HPC Node Management Service

To verify and restart the HPC Management Service

  1. Log on to the head node as a user with administrative permissions.

  2. Open the Services snap-in: Click Start, point to Administrative Tools, and then click Services.

  3. Verify that the Status of the HPC Management Service is Started.

    If the service is not started, right-click the service, then click Start. If the service is started, right-click the service, then click Restart.

Verify that the compute node is deploying from the Private network

When a DHCP Server responds, the compute node displays the IP address of the DHPC Server. In Windows HPC Server 2008, only the private network is bound to the Windows Deployment Service. If your head node is the DHCP server for your cluster, then you should see the IP address of your head node that is associated with the private network. Additionally, verify that the IP address that is assigned to the compute node is in the scope of the IPv4 Private network for you cluster.

Restart Windows Deployment Services

Windows Deployment Services enables remote Windows installation to PXE-enabled computers. The following procedure describes how to restart the Windows Deployment Services on the head node. If you are using a different server for Windows Deployment Services, the following procedures can be helpful as a guideline.

Note

Windows HPC Server 2008 uses only the Transport Server role service in the Windows Deployment Services role. The Deployment Server role service does not need to be installed.

To restart Windows Deployment Services Server

  1. On the head node, click Start, point to Administrative Tools, then click Server Manager.

  2. Expand the Roles folder, then select Windows Deployment Services.

  3. In the view pane, in Summary, under System Services, select the Windows Deployment Services Server service and click Restart.

Set HPC deployment mode to respond to all PXE requests

The way in which the head node processes PXE requests is determined by the mode in which Windows Deployment Services is running on the head node. The options are:

  • Respond only to PXE requests that come from existing compute nodes: This is the default setting. Any new computer that contacts the head node with a PXE request will be ignored.

  • Respond to all PXE requests: If a new computer contacts the head node with a PXE request, Windows Deployment Services will respond to that request and alert the HPC Management Service. The computer is then assigned a name according to the compute node naming series specified during configuration, and it is listed with that name in Node Management, under the Unknown state.

To set the Windows Deployment Services mode to temporarily respond to all PXE requests you can use the Add Node Wizard to deploy nodes from bare metal, or you can configure the setting manually. The following procedure describes how to configure the setting manually:

To set the Windows Deployment Services Mode

  1. Open HPC Cluster Manager. To open HPC Cluster Manager, click Start, point to All Programs, click Microsoft HPC Pack, and then click HPC Cluster Manager. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue..

  2. In the menu bar, click Options, and then click Deployment Settings. The Deployment Settings dialog box appears.

  3. Select the Respond to all PXE requests radio button.

  4. To set the new Windows Deployment Services mode and close the dialog box, click OK.

Verify that MAC addresses or GUIDs match

Incorrect or mismatched MAC addresses or GUIDs can cause a no bootfile error or no response to the PXE requests. This can happen if the head node already has information about the name, GUID, and MAC address of the node that does not match the actual values. For example, if you are using node XML when importing the node, the XML could be out of date, or could have a typo. Another case could be that the node had already joined the cluster and then hardware was updated on that node. Hardware values such as MAC addresses or GUIDs can change if hardware on the node is upgraded or repaired. The following steps can help you troubleshoot:

To verify and resolve a mismatched MAC or GUID – no bootfile error or PXE requests being ignored

  1. In HPC Cluster Manager, delete the node from the node list and set HPC deployment settings to Respond to all PXE requests (see previous procedure).

  2. Reboot the compute node. If the node re-appears in the node list with a new name and in the Unapproved state, this confirms that the values known by the head node were incorrect and it is a mismatched configuration issue.

  3. Right-click the node, click Edit Properties, and change the node name to the original name (you can change the name of the node while it is in the Unapproved state).

  4. Assign the desired node template.

  5. Change HPC deployment mode back to Respond only to PXE requests that come from existing nodes.

  6. Optionally, when all nodes are successfully deployed or reimaged, export the node XML.

WinPE errors

If the compute node shows an error related to GUID mismatch or security, it typically means that the GUID that is being reported by the network adapter is different than the GUID that is being reported in the WinPE environment. This can happen if the BIOS or the drivers are not up to date. In this case, you might see the following error message:

Ensure the machine GUID for this node in HPC Cluster Manager, matches the SIMBIOS GUID in this log

Some hardware platforms might remain mismatched even after updating BIOS and drivers. In those circumstances, you can configure the head node to ignore GUIDs when performing security checks between the compute nodes and the head node (see procedure below).

The following procedure can help you troubleshoot this issue:

Warning

The following procedure includes steps to modify the registry. Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

To verify and resolve a mismatched MAC or GUID – errors in the WinPE environment

  1. Verify that the BIOS versions on the head node and compute node are up to date.

  2. Ensure that the drivers for all network adapters are up to date.

  3. Verify that the GUID and MAC address that are shown when the PXE request is sent are the same as those shown within the WinPE environment (after the boot.wim file is copied to the node and loaded).

  4. If the MAC addess and GUID continue to be mismatched after updating the BIOS and drivers, then you can set a registry key on the head node to ignore GUIDs:

    HKLM\SOFTWARE\Microsoft\HPC\DisableSMBIOS type DWORD value 1

    (To open Registry Editor, click Start, click Run, type regedit, and then click OK.)

  5. Reboot the compute node and try the deployment again.

Group policy conflicts

Contact your system administrator to determine if Internet Protocol security (IPsec) is enforced on your domain through Group Policy. If IPsec is enforced on your domain through Group Policy, you may experience issues during deployment. A workaround is to make your head node an IPsec boundary server so that compute nodes can communicate with the head node during PXE boot.

The following issues might indicate that the IPsec policy is disallowing the compute nodes from talking to the head node by blocking the ports:

  • The provisioning log shows SSPI errors during domain join.

  • The provisioning log shows timeout errors during Windows Preinstallation Environment (Windows PE) boot.

You can see the provisioning log in HPC Cluster Manager in Node Management. Select a node in the list view, and then click the Provisioning Log tab in the Details Pane.

Poor performance or failures when using multicast over 10 GigE connections

There are some reported issues with Multicast and 10 GigE connections where the performance is very slow, and deployment does not fail over to Unicast.

If you see poor performance while copying the image to the nodes using multicast, try the following steps:

  1. Stop the deployment operation.

  2. Ensure that all of the drivers and firmware for the network adapters are up to date.

  3. Retry the deployment on a small set of nodes using multicast copy. If that succeeds, then continue with your normal deployment. If that fails, try the following:

    1. Create a new node template that uses unicast copy. In the Create Node Template Wizard, on the Select OS Image tab, ensure that the check box to use multicast is not selected.

      Note

      It is easier to create a new template using the node template wizard than to try to convert a multicast node template to use unicast copy instead.

    2. Deploy the nodes using the new template.

Use Windows Performance Monitor to troubleshoot early phases of deployment

You can use Windows Performance Monitor to help you monitor and troubleshoot the early phases of deployment. For more information about monitoring performance of WDS, see Using Performance Monitoring.

When you first start the PXE boot process, you can monitor and verify progress by viewing the counters that are reported in Performance Monitor under WDS TFTP Server. As long as you have nodes that are reported in HPC Cluster Manager as waiting to boot into WinPE, you should see some Active Requests reported under WDS TFTP Server. If you see 0 active requests, you should troubleshoot why the nodes are not responding to PXE requests (rather than waiting for the node template to time out).

If you are using Multicast copy, as the nodes successfully complete the TFTP phase, you should see the numbers move into the WDS Multicast Server counters. You can view the number of Active Clients under WDS Multicast Server, and as the nodes finish copying the image, the number of active clients will gradually decrease to 0. You can monitor performance during this phase by viewing the counters in the WDS Multicast Server section. High numbers in the following counters in particular indicate performance problems: Total Auto Kicked Clients, Total Master Client Switches, Total NACK Packets, and Total Repair Packets. For more information about what the counters mean, see Using Performance Monitoring.

The following image illustrates the WDS information that you can see in Performance Monitor:

WDS counters in Windows Performance Monitor

Enable WDS tracing

The following procedure describes how to enable WDS tracing. You can use the information in the tracing log file to help troubleshoot issues with WDS. For detailed information, see 936625: How to enable logging in Windows Deployment Services (WDS) in Windows Server 2003 and in Windows Server 2008.

Warning

The following procedure includes steps to modify the registry. Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

To enable WDS tracing and view the log file

  1. To open Registry Editor, click Start, click Run, type regedit, and then click OK.

  2. Locate and the click the following registry subkey:

    HKEY_LOCAL_MACHINE\Software\Microsoft\Tracing\WDSServer

  3. Right-click EnableFileTracing, and then click Modify.

  4. In the Value data box, type 1, and then click OK.

  5. View the trace log for the WDSServer component in the following location:

    %windir%\tracing\wdsserver.log

If you installed both the Domain Name System (DNS) Server service and Windows Deployment Services on the head node, the DNS Server service might bind to all ports in the WDS port range. If you are having that issue, you might find one or more error messages that resemble the following in the Wdsserver.log tracing log file:

[2416] 16:01:36: [d:\w7rtm\base\ntsetup\opktools\wds\wdssrv\server\src\udpportrange.cpp:755] Expression: , Win32 Error=0x2 [2416] 16:01:36: [d:\w7rtm\base\ntsetup\opktools\wds\wdssrv\server\src\regudpendpoint.cpp:192] Expression: , Win32 Error=0x2 [2416] 16:01:36: [d:\w7rtm\base\ntsetup\opktools\wds\wdssrv\server\inc\RegEndpoint.h:354] Expression: , Win32 Error=0x2 [2416] 16:01:36: [WDSTFTP][UDP][Ep=0] Registration Failed (rc=2)

To resolve that issue, see 977512: The DNS Server service binds to all ports in the Windows Deployment Services port range on a server that is running Windows Server 2008 R2 or Windows Server 2008.

See Also

Windows HPC Server 2008 R2: Troubleshooting [hpc08] Design and Deployment Guide for Windows HPC Server 2008 R2 [Web node] Windows HPC Server 2008 Design and Deployment Guide [HPC08 TechCenter]