Perform an HPC Database Synchronization
Updated: July 2011
Applies To: Windows HPC Server 2008 R2
Windows® HPC Server 2008 R2 includes a restore mode for the HPC Job Scheduler service, which can be configured by setting the Restore registry key to 1. When you restore the HPC databases, you must configure the cluster to enter restore mode before you restart the HPC Job Scheduler service, and follow several other steps to help synchronize the HPC databases and to return the system to a stable state.
Tipp |
---|
In a cluster running at least Windows HPC Server 2008 R2 with Service Pack 2, you can configure the cluster to enter restore mode by running the Set-HPCClusterProperty HPC PowerShell command with the –Restore parameter, instead of manually setting a Registry key. |
Important |
---|
Because of the variety of cluster deployment options, including options to configure the head node for high availability in a failover cluster, and the different restoration scenarios that are possible, you should use the steps to start the HPC Job Scheduler service in restore mode that are appropriate for your cluster configuration and situation. |
In this section:
Overview: Bringing the cluster to a consistent state during a database restore
What happens when the HPC Job Scheduler service starts in restore mode?
Start the job scheduler in restore mode
To start the job scheduler in restore mode during a full-system restore
To start the job scheduler in restore mode during a database restore on a single head node
To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster
Verify the restore operations and bring the cluster to a stable state
Filter and sort the job list to see the jobs that were canceled during restore mode
Delete the message queue on WCF broker nodes
Overview: Bringing the cluster to a consistent state during a database restore
After you restore HPC databases from a backup, the job queue in the restored databases will not be consistent with what is running on the cluster. The databases will contain jobs in the state they were in when the backup was made. Many of those jobs may already have finished. Additionally, the compute nodes may be running jobs that were submitted after the backup was made. The restored databases will have no records for these jobs.
When you restore the HPC databases, you need to perform additional steps to help to return the cluster to a consistent state. The following procedures describe these additional steps. The exact steps for restoring your system or your databases depend on the your backup and restore solution (for example, Windows Server Backup, SQL Server Backup, Data Protection Manager, or non-Microsoft solutions).
To restore the HPC databases, you need to:
Have backups of the HPC databases.
Understand what the HPC Job Scheduler service does in restore mode.
Know the steps for restoring the databases according to the backup method that you used, and know the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.
Start the HPC Job Scheduler service in restore mode.
Verify the restore operations and bring the cluster to a stable state after a database restore.
Decide how to handle jobs that were canceled by the HPC Job Scheduler service during restore mode.
Delete the message queue on the Windows Communication Foundation (WCF) broker nodes.
What happens when the HPC Job Scheduler service starts in restore mode?
Every time the HPC Job Scheduler service restarts, it checks the Restore registry key. If the key has a value of 1, then the HPC Job Scheduler service starts in restore mode. After the HPC Job Scheduler starts in restore mode, to help to bring the system to a consistent state, the service cancels all jobs in the database that are in the Submitted, Validating, Queued, or Running states. The scheduler stops all tasks that are actually running on the compute nodes (nodes periodically send status information to the head node about the jobs and tasks that are running, so even tasks that do not have records in the database are stopped).
In Event Viewer, you can see warning events from the SchedulerService that indicate the service has entered restore mode, how many jobs were canceled in each state, and the restore is complete. You will also see a warning event for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for).
After the HPC Job Scheduler service completes the restore mode steps, it clears the Restore key in the registry, writes a warning event to the system event log to indicate that the restore is complete, and then starts scheduling jobs again. This means that if users submit jobs right after the restore, the HPC Job Scheduler service will attempt to run them.
At this point, the HPC job scheduling database contains three categories of jobs:
Jobs that were Finished, Canceled, or Configuring when the backup was made. These jobs have not been changed.
Jobs that were Submitted, Validating, Queued, or Running when the backup was made. These jobs are now Canceled.
New jobs, in any state, that users submitted after the HPC Job Scheduler service completed the restore mode steps.
Start the job scheduler in restore mode
When you restore of the HPC databases, you must set the Restore registry key before restarting the HPC Job Scheduler service. The following procedures are available:
To start the job scheduler in restore mode during a full-system restore
To start the job scheduler in restore mode during a database restore on a single head node
To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster
To start the job scheduler in restore mode during a full-system restore
After you perform the full-system restore on the head node, start the head node in safe mode.
Set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:
In a cluster running at least Windows HPC Server 2008 R2 with SP2, run the
Set-HPCClusterProperty
PowerShell command with the–Restore
parameter:Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.
Type the following command:
Set-HPCClusterProperty –RestoreMode:$true
Otherwise, manually set the registry key by typing the following command at an elevated command prompt:
reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
Caution Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
Important Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen. Restart the head node in normal mode.
Continue to Verify the restore operations and bring the cluster to a stable state.
To start the job scheduler in restore mode during a database restore on a single head node
Close all instances of HPC Cluster Manager.
Caution Do not continue if HPC Cluster Manager is running. If you restore the HPC databases while HPC Cluster Manager is open, you may not be able to perform node operations in HPC Cluster Manager after you restore the databases. On the head node, stop and disable the HPC services as follows:
Open an elevated Command Prompt window.
Klicken Sie zum Öffnen eines Eingabeaufforderungsfensters mit erhöhten Rechten auf Start, klicken Sie auf Alle Programme, klicken Sie auf Zubehör, klicken Sie mit der rechten Maustaste auf Eingabeaufforderung, und klicken Sie anschließend auf Als Administrator ausführen.
At the elevated command prompt, type the following commands to stop and disable the HPC services:
sc config hpcscheduler start= disabled sc config hpcmanagement start= disabled sc config hpcreporting start= disabled sc config hpcsdm start= disabled sc config hpcdiagnostics start= disabled net stop hpcscheduler net stop hpcmanagement net stop hpcreporting net stop hpcsdm net stop hpcdiagnostics
If you previously deployed Windows Azure nodes and the state of the nodes changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.
To stop the deployment in Windows Azure
In the navigation pane of the Management Portal, click Hosted Services, Storage Accounts & CDN.
Click Hosted Services.
In the group for the hosted service that you used to deploy the Windows Azure nodes, click the deployment. This has the name Deployment for <HostedServiceName>, where <HostedServiceName> is the name of the hosted service.
On the ribbon, in the Deployments group, click Stop. The status of the service deployment status changes to Stopped.
If you have not already done so, restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for your backup solution.
For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.
Set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:
In a cluster running at least Windows HPC Server 2008 R2with SP2, run the
Set-HPCClusterProperty
PowerShell command with the–RestoreMode
parameter:Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.
Type the following command:
Set-HPCClusterProperty –RestoreMode $true
Otherwise, manually set the registry key by typing the following command at an elevated command prompt:
reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
Caution Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
Enable and start the HPC services by typing the following commands at an elevated command prompt:
sc config hpcscheduler start= auto sc config hpcmanagement start= auto sc config hpcreporting start= auto sc config hpcsdm start= auto sc config hpcdiagnostics start= auto net start hpcsdm net start hpcscheduler net start hpcmanagement net start hpcreporting net start hpcdiagnostics
Note - It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.
- After the hpcscheduler service starts, it clears the Restore key in the registry.
- It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.
Continue to Verify the restore operations and bring the cluster to a stable state.
To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster
Close all instances of HPC Cluster Manager.
Caution Do not continue if HPC Cluster Manager is running. If you restore the HPC databases while HPC Cluster Manager is open, you may not be able to perform node operations in HPC Cluster Manager after you restore the databases. Stop and disable the HPC services by doing the following:
In Failover Cluster Manager, in the resource group for the failover cluster, take the following resources offline: hpcscheduler, hpcsdm, hpcdiagnostics, and hpcsession.
On each head node computer, open an elevated Command Prompt window. Klicken Sie zum Öffnen eines Eingabeaufforderungsfensters mit erhöhten Rechten auf Start, klicken Sie auf Alle Programme, klicken Sie auf Zubehör, klicken Sie mit der rechten Maustaste auf Eingabeaufforderung, und klicken Sie anschließend auf Als Administrator ausführen.
At the elevated command prompt, type the following commands:
sc config hpcmanagement start= disabled sc config hpcreporting start= disabled net stop hpcmanagement net stop hpcreporting
If you previously deployed Windows Azure nodes and the state of the nodes changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.
To stop the deployment in Windows Azure
In the navigation pane of the Management Portal, click Hosted Services, Storage Accounts & CDN.
Click Hosted Services.
In the group for the hosted service that you used to deploy the Windows Azure nodes, click the deployment. This has the name Deployment for <HostedServiceName>, where <HostedServiceName> is the name of the hosted service.
On the ribbon, in the Deployments group, click Stop. The status of the service deployment status changes to Stopped.
If you have not already done so, restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.
For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.
On the first head node on which you will enable and start the HPC services, set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:
If your cluster is running at least Windows HPC Server 2008 R2 with SP2, run the
Set-HPCClusterProperty
PowerShell command with the–Restore
parameter:Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.
Type the following command:
Set-HPCClusterProperty –Restore:$true
Otherwise, manually set the registry key by typing the following command at an elevated command prompt:
reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
Caution Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
Make the head node on which you set the Restore registry the active head node in the failover cluster.
Enable and start the HPC services on the active head node by doing the following:
In Failover Cluster Manager, in the resource group for the failover cluster, bring the following resources online: hpcscheduler, hpcsdm, hpcdiagnostics, and hpcsession.
At an elevated command prompt on each head node computer (starting first on the active head node), type the following commands:
sc config hpcmanagement start= auto sc config hpcreporting start= auto net start hpcmanagement net start hpcreporting
Note - It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.
- After the hpcscheduler service starts, it clears the Restore key in the registry.
Continue to Verify the restore operations and bring the cluster to a stable state.
Verify the restore operations and bring the cluster to a stable state
The following procedure describes how to check the event log for restore operations and bring the cluster nodes to a stable state.
To verify the restore operations and bring the cluster to a stable state
On the head node, open Event Viewer.
Click Start, point to Administrative Tools, then click Event Viewer.
In Event Viewer, check the following:
Verify that the HPC Scheduler Service started in restore mode. You should see the following Warning event:
{Warning} [SchedulerService] The scheduler has started in restore mode.
Review the Warning events from the SchedulerService indicating how many jobs were canceled in each state. You should see a list of events similar to the following:
{Warning} [SchedulerService] 5 Running jobs were canceled during restore.
{Warning} [SchedulerService] 5 Queued jobs were canceled during restore.
{Warning} [SchedulerService] 1 Submitted jobs were canceled during restore.
{Warning} [SchedulerService] 0 Validating jobs were canceled during restore.
Review the Warning events for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for). You should see a list of events similar to the following:
{Warning} [RC] Task 27.137 is not running node R25-1234A1234 any more. Tries to cancel it
Verify that the HPC Job Scheduler service completed the restore mode steps. You should see the following Warning event:
{Warning} [SchedulerService] Scheduler restore complete.
Restart all the compute nodes, broker nodes, and workstation nodes. If you have not made any configuration changes since the backup, such as node deployment changes, then the restored database should still be aware of all the nodes that are joined to the cluster. If that is the case, you can use the clusrun command to restart all the nodes. If you made configuration changes since the last backup, you may need to manually restart the nodes. In disaster recovery situation you will need to redeploy the nodes using an appropriate deployment method.
To restart all the nodes using the clusrun command, at an elevated command prompt, type:
clusrun /all shutdown -r
The
–r
parameter, theshutdown
command indicates that the computers should restart after shutting down.Important If the head node has the compute node role or the broker node role enabled, running this command also restarts the head node. If you do not want to do this, you should, at a minimum, restart the hpcmanagement and hpcnodemanager services on the other cluster nodes. If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.
Warnung Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager. Note If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes. When the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.
Verify the health of your cluster.
Open HPC Cluster Manager, and run all the diagnostic tests. Go to Node Management to check the state and health of your compute nodes. If you see any errors or warnings, use the test result messages and the operations log to help you troubleshoot and resolve issues.
Use the information in the job queue to help you determine whether or not to requeue any of the canceled jobs. For examples of how to sort the job list in HPC Cluster Manager and use HPC PowerShell, see Filter and sort the job list to see the jobs that were canceled during restore mode.
Filter and sort the job list to see the jobs that were canceled during restore mode
After restoring the HPC databases, you need to decide how to handle the jobs that were canceled while the HPC Job Scheduler service was in restore mode. You can use the information in the job queue to help you to determine whether to requeue any of the canceled jobs. For example, the following scripts and procedure show you can sort and filter the job list by using HPC PowerShell or HPC Cluster Manager.
Note |
---|
In Windows HPC Server 2008 R2, if jobs appear stuck in the Canceling state after you restore the database, you can force cancel the jobs. For example, to force cancel all jobs that are in the Canceling state, use the following HPC PowerShell cmdlet:
Get-hpcjob –state Canceling|Stop-HpcJob |
To filter and sort the job list in HPC PowerShell
To list the jobs that were canceled during restore mode; view only the Owner, Priority, ID, SubmitTime, StartTime, and RunTime job properties; sort the output by Owner, Priority, and Submit time; and view the output in table format, use the following script:
Get-hpcjob –state Canceled| where {$_.Error like “*The scheduler is in restoration*”}| select –property Owner, Priority, ID, SubmitTime, StartTime, RunTime| sort Owner, Priority, SubmitTime| ft
Alternatively, to group the output from the previous script by job owner, use the following script:
Get-hpcjob –state Canceled| where {$_.Error like “*The scheduler is in restoration*”}| select –property Priority, ID, SubmitTime,StartTime,RunTime| sort Priority, SubmitTime| ft –groupby Owner
To requeue all the canceled jobs that were submitted 24 hours ago that have a priority of Highest, use the following script:
$yesterday=[datetime]::now.AddDays(-1) get-hpcjob –state Canceled| where {($_.submittime –gt $yesterday) –and ($_.priority –eq “Highest”)}| submit-hpcjob
To filter and sort the job list in HPC Cluster Manager
In HPC Cluster Manager, click Job Management.
In Job Management, in the navigation pane, under All Jobs, click Canceled.
Right-click the column headings in the job list, then click Column Chooser.
Use the Column Chooser dialog box to include the following job properties in the list of displayed columns:
Error Message
Owner
Priority
Submit Time
Start Time
Run Time
Click the column headers to sort the job list according to the displayed property values.
Optionally, you can requeue jobs. Select one or more jobs, and then click Requeue Job in the Actions pane.
Delete the message queue on WCF broker nodes
If your HPC cluster contains one or more WCF broker nodes, after you restore the HPC databases, you must delete Message Queuing (also known as MSMQ) on all of the WCF broker nodes. If you do not do this, you may be unable to run service-oriented architecture (SOA) jobs on your cluster.
To delete the message queue on a WCF broker node
On a broker node, start Windows PowerShell Modules as an administrator. Click Start, point to Administrative Tools, right-click Windows PowerShell Modules, and then click Run as administrator.
Type the following script:
[System.Reflection.Assembly]::LoadWithPartialName("System.Messaging") [System.Messaging.MessageQueue]::GetPrivateQueuesByMachine("localhost") | ? {"$($_.FormatName)" -like "*hpc*re*"} | % {[System.Messaging.MessageQueue]::Delete($_.Path)}