HPC Pack 2019: slowness on the scheduler due to the DBs

Alberto Garcia Fernandez 26 Reputation points
2020-12-04T11:23:43.107+00:00

Hi everyone,

We write here in order to get some insight of what's happening to our Windows HPC cluster. We have an on-premises cluster of Windows Server 2019 servers, with 3 head nodes and 136 compute nodes, deploying a HPC Pack 2019 with built'in HA and the DBs hosted externally on always-on instances. 

We find that the job validation takes too long (can be 5 minutes for the simplest of the tasks) although it gets eventually submitted. On the logs we find exceptions like the one below. And it seems that simple scheduler queries takes tens of seconds to be performed. But we have checked the DBs and don't seem to be overloaded on CPU, memory, or disk. All HPC Pack DBs (Management, Scheduler, Monitoring, Reporting, etc.) are hosted on a 16 vCPU and 64 GB RAM instance (which is replicated in always on mode). 

  • Do you think in your experience this size is appropriate ?
  • Do you think of any other parameter that may be hindering the queries? 

Any help or tip would be greatly appreciated.
Best Regards,
Alberto.

at Microsoft.Hpc.Scheduler.Store.LocalRowEnumerator.SetProperties(StoreProperty[] properties)..  
at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.validateTaskChanges(Boolean expandParametric)..  
at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.checkAndValidate()..  
at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.validateThreadMain()  
12/04/2020  07:48:12.842  w  HpcScheduler  6560  3568  [Policy] validateThreadMain in JobValidatorSingleThread suffered an exception, but will retry: Microsoft.Hpc.Scheduler.Properties.SchedulerException: The scheduler server is busy. It cannot handle the client request now. Please try again later. For detailed information please check the scheduler event log on head node. ex System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out..  
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction)..   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)..   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)..   at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async, Int32 timeout, Boolean asyncWrite)..   at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)..  
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()..

################ -

Alberto García • HPC Architect
alberto.garcia@hpcnow.com
www.hpcnow.com
Follow us on  Twitter •  Linkedin


Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
7,185 questions
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,793 questions
{count} votes