Understanding Node Metrics and Properties in HPC Cluster Manager

 

Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2

This topic describes the node properties and metrics that are available in HPC Cluster Manager to help you monitor your cluster. The node list and heat map view in HPC Cluster Manager can be modified to display various node metrics and properties. The heat map view only displays metrics. For information about creating custom node views, see Understanding Node List, Heat Map, and Custom Tab Views. For information about adding more metrics, see Customize Metrics Collection in Windows HPC Server.

In this topic:

  • Alphabetical list of node properties and metrics

  • Node properties and metrics by conceptual categories

  • Additional considerations

  • Additional references

Alphabetical list of node properties and metrics

The following table describes the available values for node properties and metrics in HPC Cluster Manager.

Note

In the “Property or metric” column, the names of metrics and of node properties that reflect node status are denoted by bold font.

Property or metric

Description

Category

Affinity

Displays the affinity setting for this node. Possible values:

  • Null – affinity for the node is managed according to the job scheduler affinity policy (see Understanding Affinity)

  • True – the HPC Node Manager Service sets affinity for all tasks that run on this node

  • False - affinity on the node is not managed by the HPC services, and the operating system or the application manages placement of tasks on physical cores

This value is set by the HPC cluster administrator.

Cores/memory/disk

Application IP

The IP address for the network adapter that is bound to the Application network.

Network

Application Link Speed

The link speed for the network adapter that is bound to the Application network.

Network

Application Link State

The link state for the network adapter that is bound to the Application network. If your cluster topology does not include an Application network, or if the node is not connected to this network, the value appears as Disconnected. Possible values are Connected and Disconnected

This value is periodically updated by the HPC Management Service during the discovery operation.

Network

Application NetworkDirect

Whether or not a NetworkDirect provider is installed for the Application network. Possible values are True and False.

This value is periodically updated by the HPC Management Service.

Network

Available Physical Memory (MBytes)

The amount of physical memory available to processes running on the computer, in megabytes. AvailableMBytes is calculated by adding the amount of space on the Zeroed, Free, and Standby memory lists. Free memory is ready for use; Zeroed memory is pages of memory filled with zeros to prevent later processes from seeing data used by a previous process; Standby memory is memory removed from a process's working set (its physical memory) en route to disk but still available to be recalled. This counter displays the last observed value only; it is not an average.

Cores/memory/disk

Boot Information

Information related to booting over the network from an iSCSI server. This specifies how the head node should respond to a PXE request from the node.

Deployment

Context Switches / second

The combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service.

Cores/memory/disk

Cores

The number of physical cores on the computer.

This value is periodically updated by the HPC Management Service during the discovery operation.

Note

If you change the hardware configuration of a compute node, ensure that the configuration change is detected and updated in the job scheduling database by taking the node Offline (preferably before making the hardware change), and then bringing the node Online again.

Cores/memory/disk

Cores In Use

The number of physical cores that are currently allocated to jobs.

Cores/memory/disk

CPU Usage (%)

User and system time for all physical cores on the node, divided by the sampling interval times the total number of physical cores on the node.

Cores/memory/disk

Description

A description for the node.

This value is set by the HPC cluster administrator.

Deployment

Disk Queue Length

An indication of the number of transactions that are waiting to be processed. This counter provides a primary measure of disk congestion. The queue length is representative of not only the number of transactions, but also the length and frequency of each transaction.

Cores/memory/disk

Disk Throughput (Bytes/sec)

An indication of the rate that data is being transferred. Describes the performance of disk throughput for the disk subsystem.

Cores/memory/disk

DNS Name

The fully qualified DNS name for the node, including the DNS suffix. For example, “myNode.myDomain.com”.

Network

Domain Name

The domain name specifications for the node.

Network

Durable Queues Total Bytes

Total number of bytes of Message Queuing messages on the broker node. The broker node stores messages using Microsoft Message Queuing (MSMQ) when SOA clients create sessions on the cluster using the Durable Session APIs. Responses that are stored by the broker can be retrieved by the client at any time, even after intentional or unintentional disconnect. Messages are deleted when SOA clients retrieve their responses and close the session, or when the job history retention period is reached (by default, this is set to three days).

By default, the MSMQ storage limit is 8 GB. When the MSMQ quota is reached, durable sessions stop working.

SOA

Durable Queues Total Messages

Total number of Message Queuing messages on the broker node.

SOA

Durable Requests Queue Length

Total number of requests stored in local Message Queuing.

SOA

Durable Responses Queue Length

Total number of responses stored in local Message Queuing.

SOA

Enterprise IP

The IP address for the network adapter that is bound to the Enterprise network.

Network

Enterprise Link Speed

The link speed for the network adapter that is bound to the Enterprise network.

Network

Enterprise Link State

The link state for the network adapter that is bound to the Enterprise network. If the node is not connected to this network, the value appears as Disconnected. Possible values are Connected and Disconnected

This value is periodically updated by the HPC Management Service during the discovery operation.

Network

Enterprise NetworkDirect

Whether or not a NetworkDirect provider is installed for the Enterprise network. Possible values are True and False.

This value is periodically updated by the HPC Management Service.

Network

Free Disk Space (%)

Percentage of total usable space on the local disk.

Cores/memory/disk

Groups

The node groups to which the node belongs. Membership in the default node groups is determined at deployment or by changing the node role. Membership in custom node groups is determined by the HPC cluster administrator.

Status/workload

HPC SOA Calculations/Sec

Current calculating calls from the broker node. This is a moving average of the past N seconds. This value can be significantly higher than the number of cores because of caching on the service host.

The HPC SOA metrics, along with the memory and CPU metrics, can help you determine how to scale your broker nodes. For example, when the SOA throughput, memory, and CPU usage are high on your broker nodes, add more brokers. When these metrics are low, convert some brokers to compute nodes. For more information, see Multiple roles and broker scaling.

SOA

HPC SOA Faults/Sec

The number of faulted calls on the node per second.

SOA

HPC SOA Requests/Sec

The number of requests to the broker node per second.

SOA

HPC SOA Responses/Sec

The number of responses on the broker node. This is a moving average of the past N seconds.

SOA

Idle

Whether or not the workstation node is idle. Possible values:

  • Null – applied to any node that is not a workstation node, and to workstation nodes that do not use the activity detection policy.

  • True – the user activity that is detected on this node is below the threshold that is defined in the Workstation Availability Policy. The node can be used to run jobs.

  • False – the user activity that is detected on this node is above the threshold that is defined in the Workstation Availability Policy. The node cannot be used to run jobs.

Status/workload

Install Path

The path where the HPC Pack software is installed.

This value is not listed for Windows Azure nodes.

Deployment

Installed Service Roles

The HPC node roles that are installed on the node. Node roles that are installed can be enabled or disabled by changing the node role (enabled roles are listed in the Node Role property). For more information, see Understanding Node Roles in Microsoft HPC Pack.

Dedicated, on-premises nodes can have the following node roles installed:

  • HeadNode (head nodes only)

  • BrokerNode

  • ComputeNode

Windows Azure nodes can have one of the following node roles installed:

  • Windows Azure Worker Node

  • Windows Azure Virtual Machine Node

Note

The Windows Azure Work Node role is available starting with HPC Pack 2008 R2 with Service Pack 1 (SP1). The Windows Azure Virtual Machine Node role is available starting with HPC Pack 2008 R2 with Service Pack 2 (SP2).

Workstation nodes can have the following role installed:

  • Workstation Node

Unmanaged server nodes can have the following role installed:

  • Unmanaged Server Node

    Note

    The Unmanaged Server Node role is available starting with HPC Pack 2008 R2 with Service Pack 3 (SP3).

Deployment

Location

The primary, secondary, and tertiary locations details for the node. For example, data center, server rack, chassis.

This property value can be specified by the HPC cluster administrator.

Deployment

LUN Mapping

A GUID that identifies the iSCSI boot node.

Deployment

Machine Guid

The SMBIOS GUID of the node.

Deployment

Management Ip Address

The out-of-band management IP address for the node that you can use for scriptable power control tools such as Intelligent Platform Management Interface (IPMI) scripts. For example, this can be set to the IP address for the Base Management Controller (BMC) of the compute node. For more information, see Scriptable Power Control Tools.

This property value can be set by the HPC cluster administrator.

Deployment

Memory

The amount of memory installed on the node.

Cores/memory/disk

Memory Paging (Hard Faults/second)

The number of hard page faults per second. A hard fault occurs when the address in memory of part of a program is no longer in main memory, but has been swapped out to the paging file, making the system look for it on the hard disk. When this occurs a lot, it causes slowdowns and increased hard disk activity. When it occurs excessively, the possibility of hard disk thrashing arises (when a program stops responding, but the hard drive continues to run for an extended period).

Cores/memory/disk

Name

The name of the node, including the domain. For example, DOMAIN\nodename.

For Windows Azure nodes, this name is AZURE\nodename.

Deployment

NetBoot MAC Address

The MAC address of the network adapter that is bound to the Private network. This is the network that is used when deploying an operating system image to the node (PXE boot).

Deployment

Network Usage (Bytes/second)

An indication of the total network throughput for all networks on a node. This does not include NetworkDirect traffic, because NetworkDirect bypasses TCP/IP.

Network

Node Health

The overall indication of node health. Indicates whether or not there are any warnings or errors that the HPC services are aware of on that node, if the node is performing an operation that was initiated by the HPC cluster administrator, or if the node has not been added to the cluster. For information about node health values, see Understanding Node States, Health, and Operations.

Status/workload

Node Name

The name of the node.

For nodes that are deployed from bare metal, this name is automatically assigned according to the node naming series that the HPC cluster administrator defines in the node template.

For Windows Azure nodes, the name starts with “AzureCN-” followed by a number. For example, AzureCN-0001.

Deployment

Node Role

The node roles that are enabled for the node. Dedicated, on-premises nodes can have more than one role enabled, depending on what roles are installed (installed roles are listed in the Installed Service Roles property). Possible values:

  • ComputeNode

  • BrokerNode

  • Unmanaged Server Node

  • Windows Azure Worker Node

  • Windows Azure Virtual Machine Node

  • Workstation Node

The head node role is not displayed in this property.

Note

The Unmanaged Server Node role is available starting with HPC Pack 2008 R2 with Service Pack 3 (SP3).

Note

The Windows Azure Work Node role is available starting with HPC Pack 2008 R2 with Service Pack 1 (SP1). The Windows Azure Virtual Machine Node role is available starting with HPC Pack 2008 R2 with Service Pack 2 (SP2).

For more information, see Understanding Node Roles in Microsoft HPC Pack.

Status/workload

Node State

The node’s deployment state, or whether or not an administrator wants the node to be available as a resource for cluster jobs (Online or Offline). For information about node state values, see Understanding Node States, Health, and Operations.

Status/workload

Node Template

The name of the node template that was used to deploy the node or to join the node to the cluster.

Deployment

OS Architecture

The operating system architecture on the node.

Deployment

OS Version

The operating system version on the node.

Deployment

Primary HeadNode

For a head node that is configured for high availability in a failover cluster, the initial head node computer on which HPC Pack is installed has a value set to True for this property.

Warning

This property is removed starting with HPC Pack 2012.

Status/workload

Private IP

The IP address for the network adapter that is bound to the Private network.

Network

Private Link Speed

The link speed for the network adapter that is bound to the Private network.

Network

Private Link State

The link state for the network adapter that is bound to the Private network. If your cluster topology does not include a Private network, or if the node is not connected to this network, the value appears as Disconnected. Possible values are Connected and Disconnected.

This value is periodically updated by the HPC Management Service during the discovery operation.

Network

Private NetworkDirect

Whether or not a NetworkDirect provider is installed for the Private network. Possible values are True and False.

This value is periodically updated by the HPC Management Service.

Network

Processors

Name and properties of the processors that are installed on the node.

Cores/memory/disk

Product Key

The Windows product key that will be used to activate the operating system on the node.

This property value can be specified by the HPC cluster administrator.

Deployment

Progress

The most recent deployment log entry during deployment or provisioning operations. You can sort by this column to help monitor deployment progress.

Deployment

Provisioned

Whether or not HPC Pack is installed on the node. Possible values are True and False.

Note

If you assign a node template that includes steps to deploy an operating system and this property is True, only the tasks in the Maintenance phase of the node template will run. If you want to reinstall the operating system, you can assign the template, then run the Reimage action.

Deployment

Running Jobs

The number of jobs that are currently using this node.

Status/workload

Running Tasks

The number of tasks, subtasks, or task processes (such as an MPI rank) that are currently using this node. The number can be higher than the number of physical cores or sockets if the subscribed cores or sockets properties are set on the node.

Status/workload

Service Health

The overall indication of the health of the HPC services. Indicates whether or not there are any warnings or errors that the HPC services are aware of on that node.

Status/workload

Sockets

The number of physical sockets on the node.

Cores/memory/disk

Subscribed Cores

The number of logical cores that the HPC Job Scheduler Service will use when it is allocating tasks to the node. It can be larger or smaller than the number of physical cores. Note: The “cores in use” metric reflects how many physical cores are in use. The “running tasks” metric can help you monitor how many subscribed cores are in use.

This value is set by the HPC cluster administrator. For more information, see Over-subscribe or under-subscribe core or socket counts on cluster nodes.

Cores/memory/disk

Subscribed Sockets

The number of logical sockets that the HPC Job Scheduler Service will use when it is allocating tasks to the node. It can be larger or smaller than the number of physical sockets.

This value is set by the HPC cluster administrator. For more information, see Over-subscribe or under-subscribe core or socket counts on cluster nodes.

Cores/memory/disk

System Calls / second

This counter is a measure of the number of calls made to the system components, Kernel mode services. This is a measure of how busy the system is managing applications and services. When compared to the Interrupts/Sec it will give you an indication of whether processor issues are hardware or software related.

Cores/memory/disk

UnattendSetup

Whether or not setup.exe ran with the –unattend flag.

Deployment

Version

The version number of HPC Pack that is installed on the node. For example:

  • HPC Pack 2008 R2 has a value of 3.0.xxxx.x.

  • HPC Pack 2008 R2 with SP4 has a value of 3.4.xxxx.x.

  • HPC Pack 2012 has a value of 4.0.xxxx.x.

Deployment

Windows Azure Instance Name

The computer name of the Windows Azure role instance. This value is assigned by Windows Azure.

Azure

Windows Azure Node Address

The IP address of the Windows Azure node. This value is assigned by Windows Azure. For a list of the public IP ranges, see the posted IP Ranges.

Azure

Windows Azure Node Size

The size of the Windows Azure node instance. The size determines number of CPU cores, memory capacity, and disk space as defined by Windows Azure.

This value is specified by the HPC cluster administrator when adding Windows Azure nodes to the cluster.

Azure

Windows Azure Service Name

The public name of the hosted service (in the Windows Azure subscription) in which this Windows Azure node is deployed.

This value is defined by the HPC cluster administrator in the node template.

Azure

Windows Azure Storage Service Name

The public name of the storage account (in the Windows Azure subscription) that is associated with the Windows Azure node.

This value is defined by the HPC cluster administrator in the node template.

Azure

Windows Azure Subscription ID

The unique ID for the Windows Azure subscription account associated with the Windows Azure node.

This value is defined by the HPC cluster administrator in the node template.

Azure

Node properties and metrics by conceptual categories

The following lists group the properties and metrics by functional categories so that you can quickly identify what values are available for different aspects of the cluster. These lists can help you select which values to display in custom node views to help monitor different aspects of cluster performance. In the following lists, the names of metrics and of node properties that reflect node status are denoted by bold font.

Cores/memory/disk

  • Processors

  • Cores

  • Sockets

  • Cores In Use

  • CPU Usage (%)

  • Context Switches / second

  • System Calls / second

  • Affinity

  • Subscribed Cores

  • Subscribed Sockets

  • Memory

  • Available Physical Memory (MBytes)

  • Memory Paging (Hard Faults/second)

  • Free Disk Space (%)

  • Disk Queue Length

  • Disk Throughput (Bytes/sec)

Status/workload

  • Node State

  • Node Health

  • Node Role

  • Groups

  • Primary HeadNode

  • Service Health

  • Idle

  • Running Jobs

  • Running Tasks

SOA

  • Durable Queues Total Bytes

  • Durable Queues Total Messages

  • Durable Requests Queue

  • Durable Responses Queue

  • HPC SOA Calculations/Sec

  • HPC SOA Faults/Sec

  • HPC SOA Requests/Sec

  • HPC SOA Responses/Sec

Network

  • DNS Name

  • Domain Name

  • Enterprise IP

  • Enterprise Link Speed

  • Enterprise Link State

  • Enterprise NetworkDirect

  • Private IP

  • Private Link Speed

  • Private Link State

  • Private NetworkDirect

  • Application IP

  • Application Link Speed

  • Application Link State

  • Application Network Direct

  • Network Usage (Bytes/second)

Deployment

  • Name

  • Node Name

  • Node Template

  • Description

  • Location

  • Machine Guid

  • NetBoot MAC Address

  • Boot Information

  • Install Path

  • Version

  • Installed Service Roles

  • OS Architecture

  • OS Version

  • Product Key

  • Management Ip Address

  • LUN Mapping

  • Provisioned

  • UnattendSetup

  • Progress

Azure

  • Size

  • Windows Azure Instance Name

  • Windows Azure Node Address

  • Windows Azure Node Size

  • Windows Azure Service Name

  • Windows Azure Storage Service Name

  • Windows Azure Subscription ID

Additional considerations

HPC Pack 2008 R2 SP1 additions

The following properties or metrics were added in Service Pack 1 of HPC Pack 2008 R2. These changes are related to the ability to add Windows Azure nodes to the cluster. For more information, see Deploying Azure Nodes with Microsoft HPC Pack [RETIRED].

  • Size

  • Windows Azure Node Address

  • Windows Azure Service Name

  • Windows Azure Storage Service Name

  • Windows Azure Subscription ID

HPC Pack 2008 R2 SP2 additions

The following properties or metrics were added in Service Pack 2 of HPC Pack 2008 R2. These changes are related to the ability to oversubscribe and undersubscribe nodes. For more information, see Over-subscribe or under-subscribe core or socket counts on cluster nodes.

  • Affinity

  • Subscribed Cores

  • Subscribed Sockets

Additional references