Understanding Node Metrics and Properties in HPC Cluster Manager
Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
This topic describes the node properties and metrics that are available in HPC Cluster Manager to help you monitor your cluster. The node list and heat map view in HPC Cluster Manager can be modified to display various node metrics and properties. The heat map view only displays metrics. For information about creating custom node views, see Understanding Node List, Heat Map, and Custom Tab Views. For information about adding more metrics, see Customize Metrics Collection in Windows HPC Server.
In this topic:
Alphabetical list of node properties and metrics
Node properties and metrics by conceptual categories
Additional considerations
Additional references
Alphabetical list of node properties and metrics
The following table describes the available values for node properties and metrics in HPC Cluster Manager.
Note
In the “Property or metric” column, the names of metrics and of node properties that reflect node status are denoted by bold font.
Property or metric |
Description |
Category |
---|---|---|
Affinity |
Displays the affinity setting for this node. Possible values:
This value is set by the HPC cluster administrator. |
Cores/memory/disk |
Application IP |
The IP address for the network adapter that is bound to the Application network. |
Network |
Application Link Speed |
The link speed for the network adapter that is bound to the Application network. |
Network |
Application Link State |
The link state for the network adapter that is bound to the Application network. If your cluster topology does not include an Application network, or if the node is not connected to this network, the value appears as Disconnected. Possible values are Connected and Disconnected This value is periodically updated by the HPC Management Service during the discovery operation. |
Network |
Application NetworkDirect |
Whether or not a NetworkDirect provider is installed for the Application network. Possible values are True and False. This value is periodically updated by the HPC Management Service. |
Network |
Available Physical Memory (MBytes) |
The amount of physical memory available to processes running on the computer, in megabytes. AvailableMBytes is calculated by adding the amount of space on the Zeroed, Free, and Standby memory lists. Free memory is ready for use; Zeroed memory is pages of memory filled with zeros to prevent later processes from seeing data used by a previous process; Standby memory is memory removed from a process's working set (its physical memory) en route to disk but still available to be recalled. This counter displays the last observed value only; it is not an average. |
Cores/memory/disk |
Boot Information |
Information related to booting over the network from an iSCSI server. This specifies how the head node should respond to a PXE request from the node. |
Deployment |
Context Switches / second |
The combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. |
Cores/memory/disk |
Cores |
The number of physical cores on the computer. This value is periodically updated by the HPC Management Service during the discovery operation. Note If you change the hardware configuration of a compute node, ensure that the configuration change is detected and updated in the job scheduling database by taking the node Offline (preferably before making the hardware change), and then bringing the node Online again. |
Cores/memory/disk |
Cores In Use |
The number of physical cores that are currently allocated to jobs. |
Cores/memory/disk |
CPU Usage (%) |
User and system time for all physical cores on the node, divided by the sampling interval times the total number of physical cores on the node. |
Cores/memory/disk |
Description |
A description for the node. This value is set by the HPC cluster administrator. |
Deployment |
Disk Queue Length |
An indication of the number of transactions that are waiting to be processed. This counter provides a primary measure of disk congestion. The queue length is representative of not only the number of transactions, but also the length and frequency of each transaction. |
Cores/memory/disk |
Disk Throughput (Bytes/sec) |
An indication of the rate that data is being transferred. Describes the performance of disk throughput for the disk subsystem. |
Cores/memory/disk |
DNS Name |
The fully qualified DNS name for the node, including the DNS suffix. For example, “myNode.myDomain.com”. |
Network |
Domain Name |
The domain name specifications for the node. |
Network |
Durable Queues Total Bytes |
Total number of bytes of Message Queuing messages on the broker node. The broker node stores messages using Microsoft Message Queuing (MSMQ) when SOA clients create sessions on the cluster using the Durable Session APIs. Responses that are stored by the broker can be retrieved by the client at any time, even after intentional or unintentional disconnect. Messages are deleted when SOA clients retrieve their responses and close the session, or when the job history retention period is reached (by default, this is set to three days). By default, the MSMQ storage limit is 8 GB. When the MSMQ quota is reached, durable sessions stop working. |
SOA |
Durable Queues Total Messages |
Total number of Message Queuing messages on the broker node. |
SOA |
Durable Requests Queue Length |
Total number of requests stored in local Message Queuing. |
SOA |
Durable Responses Queue Length |
Total number of responses stored in local Message Queuing. |
SOA |
Enterprise IP |
The IP address for the network adapter that is bound to the Enterprise network. |
Network |
Enterprise Link Speed |
The link speed for the network adapter that is bound to the Enterprise network. |
Network |
Enterprise Link State |
The link state for the network adapter that is bound to the Enterprise network. If the node is not connected to this network, the value appears as Disconnected. Possible values are Connected and Disconnected This value is periodically updated by the HPC Management Service during the discovery operation. |
Network |
Enterprise NetworkDirect |
Whether or not a NetworkDirect provider is installed for the Enterprise network. Possible values are True and False. This value is periodically updated by the HPC Management Service. |
Network |
Free Disk Space (%) |
Percentage of total usable space on the local disk. |
Cores/memory/disk |
Groups |
The node groups to which the node belongs. Membership in the default node groups is determined at deployment or by changing the node role. Membership in custom node groups is determined by the HPC cluster administrator. |
Status/workload |
HPC SOA Calculations/Sec |
Current calculating calls from the broker node. This is a moving average of the past N seconds. This value can be significantly higher than the number of cores because of caching on the service host. The HPC SOA metrics, along with the memory and CPU metrics, can help you determine how to scale your broker nodes. For example, when the SOA throughput, memory, and CPU usage are high on your broker nodes, add more brokers. When these metrics are low, convert some brokers to compute nodes. For more information, see Multiple roles and broker scaling. |
SOA |
HPC SOA Faults/Sec |
The number of faulted calls on the node per second. |
SOA |
HPC SOA Requests/Sec |
The number of requests to the broker node per second. |
SOA |
HPC SOA Responses/Sec |
The number of responses on the broker node. This is a moving average of the past N seconds. |
SOA |
Idle |
Whether or not the workstation node is idle. Possible values:
|
Status/workload |
Install Path |
The path where the HPC Pack software is installed. This value is not listed for Windows Azure nodes. |
Deployment |
Installed Service Roles |
The HPC node roles that are installed on the node. Node roles that are installed can be enabled or disabled by changing the node role (enabled roles are listed in the Node Role property). For more information, see Understanding Node Roles in Microsoft HPC Pack. Dedicated, on-premises nodes can have the following node roles installed:
Windows Azure nodes can have one of the following node roles installed:
Note The Windows Azure Work Node role is available starting with HPC Pack 2008 R2 with Service Pack 1 (SP1). The Windows Azure Virtual Machine Node role is available starting with HPC Pack 2008 R2 with Service Pack 2 (SP2). Workstation nodes can have the following role installed:
Unmanaged server nodes can have the following role installed:
|
Deployment |
Location |
The primary, secondary, and tertiary locations details for the node. For example, data center, server rack, chassis. This property value can be specified by the HPC cluster administrator. |
Deployment |
LUN Mapping |
A GUID that identifies the iSCSI boot node. |
Deployment |
Machine Guid |
The SMBIOS GUID of the node. |
Deployment |
Management Ip Address |
The out-of-band management IP address for the node that you can use for scriptable power control tools such as Intelligent Platform Management Interface (IPMI) scripts. For example, this can be set to the IP address for the Base Management Controller (BMC) of the compute node. For more information, see Scriptable Power Control Tools. This property value can be set by the HPC cluster administrator. |
Deployment |
Memory |
The amount of memory installed on the node. |
Cores/memory/disk |
Memory Paging (Hard Faults/second) |
The number of hard page faults per second. A hard fault occurs when the address in memory of part of a program is no longer in main memory, but has been swapped out to the paging file, making the system look for it on the hard disk. When this occurs a lot, it causes slowdowns and increased hard disk activity. When it occurs excessively, the possibility of hard disk thrashing arises (when a program stops responding, but the hard drive continues to run for an extended period). |
Cores/memory/disk |
Name |
The name of the node, including the domain. For example, DOMAIN\nodename. For Windows Azure nodes, this name is AZURE\nodename. |
Deployment |
NetBoot MAC Address |
The MAC address of the network adapter that is bound to the Private network. This is the network that is used when deploying an operating system image to the node (PXE boot). |
Deployment |
Network Usage (Bytes/second) |
An indication of the total network throughput for all networks on a node. This does not include NetworkDirect traffic, because NetworkDirect bypasses TCP/IP. |
Network |
Node Health |
The overall indication of node health. Indicates whether or not there are any warnings or errors that the HPC services are aware of on that node, if the node is performing an operation that was initiated by the HPC cluster administrator, or if the node has not been added to the cluster. For information about node health values, see Understanding Node States, Health, and Operations. |
Status/workload |
Node Name |
The name of the node. For nodes that are deployed from bare metal, this name is automatically assigned according to the node naming series that the HPC cluster administrator defines in the node template. For Windows Azure nodes, the name starts with “AzureCN-” followed by a number. For example, AzureCN-0001. |
Deployment |
Node Role |
The node roles that are enabled for the node. Dedicated, on-premises nodes can have more than one role enabled, depending on what roles are installed (installed roles are listed in the Installed Service Roles property). Possible values:
The head node role is not displayed in this property. Note The Unmanaged Server Node role is available starting with HPC Pack 2008 R2 with Service Pack 3 (SP3). Note The Windows Azure Work Node role is available starting with HPC Pack 2008 R2 with Service Pack 1 (SP1). The Windows Azure Virtual Machine Node role is available starting with HPC Pack 2008 R2 with Service Pack 2 (SP2). For more information, see Understanding Node Roles in Microsoft HPC Pack. |
Status/workload |
Node State |
The node’s deployment state, or whether or not an administrator wants the node to be available as a resource for cluster jobs (Online or Offline). For information about node state values, see Understanding Node States, Health, and Operations. |
Status/workload |
Node Template |
The name of the node template that was used to deploy the node or to join the node to the cluster. |
Deployment |
OS Architecture |
The operating system architecture on the node. |
Deployment |
OS Version |
The operating system version on the node. |
Deployment |
Primary HeadNode |
For a head node that is configured for high availability in a failover cluster, the initial head node computer on which HPC Pack is installed has a value set to True for this property. Warning This property is removed starting with HPC Pack 2012. |
Status/workload |
Private IP |
The IP address for the network adapter that is bound to the Private network. |
Network |
Private Link Speed |
The link speed for the network adapter that is bound to the Private network. |
Network |
Private Link State |
The link state for the network adapter that is bound to the Private network. If your cluster topology does not include a Private network, or if the node is not connected to this network, the value appears as Disconnected. Possible values are Connected and Disconnected. This value is periodically updated by the HPC Management Service during the discovery operation. |
Network |
Private NetworkDirect |
Whether or not a NetworkDirect provider is installed for the Private network. Possible values are True and False. This value is periodically updated by the HPC Management Service. |
Network |
Processors |
Name and properties of the processors that are installed on the node. |
Cores/memory/disk |
Product Key |
The Windows product key that will be used to activate the operating system on the node. This property value can be specified by the HPC cluster administrator. |
Deployment |
Progress |
The most recent deployment log entry during deployment or provisioning operations. You can sort by this column to help monitor deployment progress. |
Deployment |
Provisioned |
Whether or not HPC Pack is installed on the node. Possible values are True and False. Note If you assign a node template that includes steps to deploy an operating system and this property is True, only the tasks in the Maintenance phase of the node template will run. If you want to reinstall the operating system, you can assign the template, then run the Reimage action. |
Deployment |
Running Jobs |
The number of jobs that are currently using this node. |
Status/workload |
Running Tasks |
The number of tasks, subtasks, or task processes (such as an MPI rank) that are currently using this node. The number can be higher than the number of physical cores or sockets if the subscribed cores or sockets properties are set on the node. |
Status/workload |
Service Health |
The overall indication of the health of the HPC services. Indicates whether or not there are any warnings or errors that the HPC services are aware of on that node. |
Status/workload |
Sockets |
The number of physical sockets on the node. |
Cores/memory/disk |
Subscribed Cores |
The number of logical cores that the HPC Job Scheduler Service will use when it is allocating tasks to the node. It can be larger or smaller than the number of physical cores. Note: The “cores in use” metric reflects how many physical cores are in use. The “running tasks” metric can help you monitor how many subscribed cores are in use. This value is set by the HPC cluster administrator. For more information, see Over-subscribe or under-subscribe core or socket counts on cluster nodes. |
Cores/memory/disk |
Subscribed Sockets |
The number of logical sockets that the HPC Job Scheduler Service will use when it is allocating tasks to the node. It can be larger or smaller than the number of physical sockets. This value is set by the HPC cluster administrator. For more information, see Over-subscribe or under-subscribe core or socket counts on cluster nodes. |
Cores/memory/disk |
System Calls / second |
This counter is a measure of the number of calls made to the system components, Kernel mode services. This is a measure of how busy the system is managing applications and services. When compared to the Interrupts/Sec it will give you an indication of whether processor issues are hardware or software related. |
Cores/memory/disk |
UnattendSetup |
Whether or not setup.exe ran with the –unattend flag. |
Deployment |
Version |
The version number of HPC Pack that is installed on the node. For example:
|
Deployment |
Windows Azure Instance Name |
The computer name of the Windows Azure role instance. This value is assigned by Windows Azure. |
Azure |
Windows Azure Node Address |
The IP address of the Windows Azure node. This value is assigned by Windows Azure. For a list of the public IP ranges, see the posted IP Ranges. |
Azure |
Windows Azure Node Size |
The size of the Windows Azure node instance. The size determines number of CPU cores, memory capacity, and disk space as defined by Windows Azure. This value is specified by the HPC cluster administrator when adding Windows Azure nodes to the cluster. |
Azure |
Windows Azure Service Name |
The public name of the hosted service (in the Windows Azure subscription) in which this Windows Azure node is deployed. This value is defined by the HPC cluster administrator in the node template. |
Azure |
Windows Azure Storage Service Name |
The public name of the storage account (in the Windows Azure subscription) that is associated with the Windows Azure node. This value is defined by the HPC cluster administrator in the node template. |
Azure |
Windows Azure Subscription ID |
The unique ID for the Windows Azure subscription account associated with the Windows Azure node. This value is defined by the HPC cluster administrator in the node template. |
Azure |
Node properties and metrics by conceptual categories
The following lists group the properties and metrics by functional categories so that you can quickly identify what values are available for different aspects of the cluster. These lists can help you select which values to display in custom node views to help monitor different aspects of cluster performance. In the following lists, the names of metrics and of node properties that reflect node status are denoted by bold font.
Cores/memory/disk
Processors
Cores
Sockets
Cores In Use
CPU Usage (%)
Context Switches / second
System Calls / second
Affinity
Subscribed Cores
Subscribed Sockets
Memory
Available Physical Memory (MBytes)
Memory Paging (Hard Faults/second)
Free Disk Space (%)
Disk Queue Length
Disk Throughput (Bytes/sec)
Status/workload
Node State
Node Health
Node Role
Groups
Primary HeadNode
Service Health
Idle
Running Jobs
Running Tasks
SOA
Durable Queues Total Bytes
Durable Queues Total Messages
Durable Requests Queue
Durable Responses Queue
HPC SOA Calculations/Sec
HPC SOA Faults/Sec
HPC SOA Requests/Sec
HPC SOA Responses/Sec
Network
DNS Name
Domain Name
Enterprise IP
Enterprise Link Speed
Enterprise Link State
Enterprise NetworkDirect
Private IP
Private Link Speed
Private Link State
Private NetworkDirect
Application IP
Application Link Speed
Application Link State
Application Network Direct
Network Usage (Bytes/second)
Deployment
Name
Node Name
Node Template
Description
Location
Machine Guid
NetBoot MAC Address
Boot Information
Install Path
Version
Installed Service Roles
OS Architecture
OS Version
Product Key
Management Ip Address
LUN Mapping
Provisioned
UnattendSetup
Progress
Azure
Size
Windows Azure Instance Name
Windows Azure Node Address
Windows Azure Node Size
Windows Azure Service Name
Windows Azure Storage Service Name
Windows Azure Subscription ID
Additional considerations
HPC Pack 2008 R2 SP1 additions
The following properties or metrics were added in Service Pack 1 of HPC Pack 2008 R2. These changes are related to the ability to add Windows Azure nodes to the cluster. For more information, see Deploying Azure Nodes with Microsoft HPC Pack [RETIRED].
Size
Windows Azure Node Address
Windows Azure Service Name
Windows Azure Storage Service Name
Windows Azure Subscription ID
HPC Pack 2008 R2 SP2 additions
The following properties or metrics were added in Service Pack 2 of HPC Pack 2008 R2. These changes are related to the ability to oversubscribe and undersubscribe nodes. For more information, see Over-subscribe or under-subscribe core or socket counts on cluster nodes.
Affinity
Subscribed Cores
Subscribed Sockets