Stop Slurm nodes deallocating/terminating whilst there are still jobs in the queue?

Question

Stop Slurm nodes deallocating/terminating whilst there are still jobs in the queue?

Matt Jackson 6

Hi,

I was wondering if I am missing some configuration parameters in CycleCloud (or Slurm) to prevent Slurm nodes being span up and down just for single jobs.

I have disabled autoscaling for now and set nodes to deallocate instead of terminating on stop. If I start a bunch of nodes before I submit any jobs these then straight away run the jobs at the top of the queue. However, these then deallocate and new nodes (of the same size) spin up to handle the next jobs in the queue.

Given the time it takes to aquire the VMs, is there a method to stop the 'Ready' nodes deallocating and then use those for the next jobs in the queue?

Thanks

3 answers

Your answer

Answer 1

Matt Jackson 6

I was using the default Slurm cluster config, setting it up via the CycleCloud GUI, but I was using Slurm v19.
I will try a cluster using v20 and see if I get the expected results and provide an update.

Thanks for the advice.
Matt

Answer 2

@Matt Jackson Can you please share which version of Slurm is this? Is it an entirely CycleCloud managed cluster with our
out of the box configuration, or is it a custom installation? If you’ve modified slurm.conf, could you share that?

We depend entirely on Slurm for the job allocation, and depending on the configuration and timing there are definitely scenarios where Slurm will spin up a new node rather than reuse the existing node. In general, we have seen the “expected behavior” more than we see Slurm refusing to reuse idle nodes.

Also, Slurm v19 and below version don't re-use the idle nodes that are already powered on.
This is a known behavior for those version.
Starting from Slurm v20 this behavior has changed and idle nodes, that are already powered on, are re-used instead of spinning up new ones.

Please let me know the details and I can help further. Thanks.

Answer 3

Matt Jackson 6

I did a quick test using v20. It did seem to keep some of the nodes active whilst there were still jobs in the queue.

I am still having some issues with the slow time to spin nodes up and down which seems to be causing issues with the job creation workflow I am using (nextflow) but I think this is independent of the original question I raised.

Thanks again,
Matt

Share via

Stop Slurm nodes deallocating/terminating whilst there are still jobs in the queue?

3 answers

Your answer