Azure Cyclecloud Config file does not match number of CPUs on a node in each cluster.

Shannon Cosgrove 1 Reputation point
2022-01-13T19:25:17.77+00:00

Hi everyone.

I am using slurm to run a script on Azure Cyclecloud and the script uses all of the cores. When I run it on the cluster, it is only using half of the cores on the node. The cyclecloud.conf and slurm.conf files are only specifying 16 CPUs instead of the 32 on the nodes. When I change the conf file to have the correct number of CPUs (32) it still does not run on all of them. If I change the conf file AND remove/rescale the nodes it also does not run on all of them.

Please let me know if anyone can help. It seems like I need to change the CPUs in something that is controlling the conf files.

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
66 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. vipullag-MSFT 26,396 Reputation points
    2022-01-14T11:39:31.867+00:00

    @Shannon Cosgrove

    Thanks for reaching out to Microsoft Q&A Platform.

    It might be that those machines may have Hyper-threaded enabled, that is why taking half the cores count.

    You can try changing the cyclecloud.conf to match CPUs and ThreadsPerCore=1 and then restart slurmctld.

    But these changes will get removed once you put remove_nodes/scale command. Check using "scontrol show nodes" and check "CPUTot".

    Hope this helps.
    Please 'Accept as answer' if the provided information is helpful, so that it can help others in the community looking for help on similar topics.

    0 comments No comments

  2. Shannon Cosgrove 1 Reputation point
    2022-01-14T15:22:13.34+00:00

    Hi! Thanks @vipullag-MSFT for your answer. I did scontrol show nodes and this is what it is showing for this type of machine (Standard_F32s_v2)

    CPUAlloc=0 CPUTot=16 CPULoad=11.72
    AvailableFeatures=cloud
    ActiveFeatures=cloud
    Gres=(null)
    NodeAddr=hpc-pg0-1 NodeHostName=hpc-pg0-1 Port=0 Version=19.05.8
    OS=Linux 5.4.0-1064-azure #67~18.04.1-Ubuntu SMP Wed Nov 10 11:38:21 UTC 2021
    RealMemory=62259 AllocMem=0 FreeMem=60022 Sockets=16 Boards=1
    State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=hpc
    BootTime=2022-01-14T02:04:35 SlurmdStartTime=2022-01-14T02:10:58
    CfgTRES=cpu=16,mem=62259M,billing=16
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    I don't see anywhere in here that there are two threads per core or two sockets so I understand why slurm is getting confused and only running on half of the cores. But I used multiprocessing.cpu_count() and the multiprocessing function is seeing 32 cores. Any idea what I need to change in the slurm script to accomodate to this type of machine?


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.