HPC Pack 2016 Cluster stuck after database failure. Jobs are queued and unable to cancel jobs

Question

All HPC jobs stuck in a state when database failed. We tried to cancel those jobs, but no luck. New jobs going running(Queued) state. We are stand still now. We rebooted database and it is up and running. Rebooted entire cluster(HPC nodes). Help appreciated. Thanks

Accepted Answer

It was found Resource table contains bad resources created with core -1 socket -1 and jobs got allocated to those bad resources. Also, it was noticed that all resource was busy with some jobs.

Stopped all services and cleared the data and started all services. Back in business.

HPC Pack 2016 Cluster stuck after database failure. Jobs are queued and unable to cancel jobs

0 additional answers