HPC Pack 2016 Cluster stuck after database failure. Jobs are queued and unable to cancel jobs

Ravindra Ravu 101 Reputation points
2021-02-25T16:12:08.77+00:00

All HPC jobs stuck in a state when database failed. We tried to cancel those jobs, but no luck. New jobs going running(Queued) state. We are stand still now. We rebooted database and it is up and running. Rebooted entire cluster(HPC nodes). Help appreciated. Thanks

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
7,081 questions
0 comments No comments
{count} votes

Accepted answer
  1. Ravindra Ravu 101 Reputation points
    2021-02-26T21:08:49.703+00:00

    It was found Resource table contains bad resources created with core -1 socket -1 and jobs got allocated to those bad resources. Also, it was noticed that all resource was busy with some jobs.

    Stopped all services and cleared the data and started all services. Back in business.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful