HPC Pack 2019: slowness on the scheduler due to the DBs
Hi everyone,
We write here in order to get some insight of what's happening to our Windows HPC cluster. We have an on-premises cluster of Windows Server 2019 servers, with 3 head nodes and 136 compute nodes, deploying a HPC Pack 2019 with built'in HA and the DBs hosted externally on always-on instances.
We find that the job validation takes too long (can be 5 minutes for the simplest of the tasks) although it gets eventually submitted. On the logs we find exceptions like the one below. And it seems that simple scheduler queries takes tens of seconds to be performed. But we have checked the DBs and don't seem to be overloaded on CPU, memory, or disk. All HPC Pack DBs (Management, Scheduler, Monitoring, Reporting, etc.) are hosted on a 16 vCPU and 64 GB RAM instance (which is replicated in always on mode).
- Do you think in your experience this size is appropriate ?
- Do you think of any other parameter that may be hindering the queries?
Any help or tip would be greatly appreciated.
Best Regards,
Alberto.
at Microsoft.Hpc.Scheduler.Store.LocalRowEnumerator.SetProperties(StoreProperty[] properties)..
at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.validateTaskChanges(Boolean expandParametric)..
at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.checkAndValidate()..
at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.validateThreadMain()
12/04/2020 07:48:12.842 w HpcScheduler 6560 3568 [Policy] validateThreadMain in JobValidatorSingleThread suffered an exception, but will retry: Microsoft.Hpc.Scheduler.Properties.SchedulerException: The scheduler server is busy. It cannot handle the client request now. Please try again later. For detailed information please check the scheduler event log on head node. ex System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out..
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction).. at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose).. at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady).. at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async, Int32 timeout, Boolean asyncWrite).. at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource
1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)..
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()..
################ -
Alberto García • HPC Architect
alberto.garcia@hpcnow.com
www.hpcnow.com
Follow us on Twitter • Linkedin