Understanding the Workstation Node and Unmanaged Server Node Availability Policy
Applies To: Microsoft HPC Pack 2008 R2, Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
Workstation nodes and unmanaged server nodes can be brought online to run jobs and be taken offline manually or automatically. If you want workstation nodes and unmanaged server nodes to be brought online and offline automatically, you must specify a weekly availability policy in the node template.
Note
Both workstation nodes and unmanaged server nodes are supported starting in HPC Pack 2008 R2 with SP3. Only workstation nodes are supported in earlier versions of HPC Pack 2008 R2.
The availability policy specifies one or more time periods during each week when the nodes are made available (brought to the online state) to run cluster jobs. You can specify multiple times each week when you want the nodes to be available to run jobs – for example, every night on weekdays and all day on weekends. The cluster automatically brings online the workstation nodes and unmanaged server nodes at the beginning of each online time block. The nodes are then immediately available to run jobs that have been submitted to the cluster. At the end of each time block, the nodes are automatically taken offline. Optionally, you can specify a time interval before the end of an online block when any jobs running on the workstation nodes and unmanaged server nodes are drained.
If your version of Microsoft® HPC Pack supports it, you can also configure user activity detection settings in the template. The user activity detection settings ensure that the cluster only runs jobs on the workstation nodes and unmanaged server nodes that are not otherwise in use (based on keyboard, mouse or CPU activity) during an online time block. For more information, see Understanding User Activity Detection.
Interaction of the availability policy with the Task Cancel Grace Period setting
When an automatic availability policy is configured, the workstation nodes and unmanaged server nodes do not start jobs after an online time block passes. However, HPC tasks that are still running at the end of an online time block can continue to run for a period if the Task Cancel Grace Period setting is configured. The Task Cancel Grace Period cluster property allows applications to save state information and clean up for a period before exiting (the default period is 15 seconds). The exact time that a task ends depends on whether and how quickly the task responds to the CTRL_BREAK event (the equivalent of the CTRL+BREAK key combination). Tasks that do not process the event will exit immediately, while those that do process the event can take as long as the Task Cancel Grace Period to exit gracefully.
Because the Task Cancel Grace Period always starts at the end of the online time block for workstation nodes and unmanaged server nodes, those nodes might continue to run HPC tasks for the duration of the Task Cancel Grace Period (or until the tasks process the CTRL_BREAK event and stop). It is possible for HPC tasks to continue after users have resumed activity on the nodes; however, the time of potential overlap is likely to be short.
Note
The beginning of the Task Cancel Grace Period on workstation nodes is not affected by the configuration of a task draining period in the availability policy.
The following are recommended best practices to avoid running HPC tasks on workstation nodes and unmanaged server nodes inadvertently during unscheduled times if a Task Cancel Grace period is configured:
Specify as small a value for the Task Cancel Grace Period as possible (for example, a value in seconds, not minutes).
Ensure that the HPC applications that run on the workstation nodes that use the Task Cancel Grace Period can clean up quickly and exit. Applications that do not exit soon after receiving the CTRL_BREAK event can continue running as long as the Task Cancel Grace Period.
If supported by your version of HPC Pack, configure user activity detection settings in the availability policy. These settings help ensure that HPC tasks run with Below Normal priority on the workstations, and the tasks relinquish the system as soon as user activity is detected on the workstations.
Additional considerations
Workstation nodes and unmanaged server nodes that are configured to be brought online and offline according to a weekly availability policy cannot be brought online or offline manually. To configure these nodes to be brought online and offline manually, you must assign to them a different workstation node template, or you must modify their current workstation node template.
Changes made to a node template affect all workstation nodes and unmanaged server nodes to which the template is assigned.
If you want to have different availability policies for different groups of workstation nodes and unmanaged server nodes, create a different node template to apply to each group.
See Also
Understanding Node States, Health, and Operations
Understanding User Activity Detection
Task cancel grace period