CycleCloud Slurm Cluster cannot seem to communicate with the nodes

Gary Mansell 111 Reputation points
2022-12-09T11:07:38.83+00:00

This is my first stab at setting up Azure HPC using CycleCloud and Slurm, so forgive me for stupid mistakes...

I have built a simple (default) Slurm cluster using CycleCloud and the nodes start/stop OK, but when I run a simple (hostname) job it just hangs at "Completing".

Debug logging seems to suggest that the Slurm Scheduler node cannot communicate with the nodes:

[azccadmin@SlurmCluster-1-scheduler data]$ srun -N2 -n2 -t00:15:00 -Jgrma-hostname hostname.sh
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-1"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-1, check slurm.conf
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-2"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-2, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-2: Can't find an address, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-1: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted

I cannot ping the node from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler ~]$ ping slurmcluster-1-hpc-pg0-1
ping: slurmcluster-1-hpc-pg0-1: Name or service not known

It seems that the Slurm node name does not match the Azure Hostname, so the name resolution is broken for slurm as it is trying to contact a nodename that Azure networking knows nothing about?

Slurm is trying to talk to the node as "slurmcluster-1-hpc-pg0-1", but the hostname of the node is actually "slurmcluster-1-slurmcluster-1-hpc-pg0-1"

268855-image.png

And I can ping it with this name from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler data]$ ping slurmcluster-1-slurmcluster-1-hpc-pg0-1
PING slurmcluster-1-slurmcluster-1-hpc-pg0-1.yr5ran05dk5uzlz13kqz0cq4xe.ax.internal.cloudapp.net (192.168.140.6) 56(84) bytes of data.
64 bytes from slurmcluster-1-slurmcluster-1-hpc-pg0-1.internal.cloudapp.net (192.168.140.6): icmp_seq=1 ttl=64 time=1.77 ms

I am using CycleCloud 8.3 - which I note has a fix for Slurm NodeName / Azure Hostname (but this seems to still be an issue)?

Thanks

Gary

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
59 questions
0 comments No comments
{count} votes

Accepted answer
  1. vipullag-MSFT 24,206 Reputation points Microsoft Employee
    2022-12-12T05:24:50.037+00:00

    @Gary Mansell

    Welcome to Microsoft Q&A Platform, thanks for posting your query here.

    I see that you have already opened an issue here #105, Product team is already engaged providing you the solution.

    Based on the ongoing conversation on the reported issue, team has done changes for removing case sensitivity in DNS check as this is the reason you are seeing the issue.

    Refer the changes done here #106

    Till these changes are approved, quick solution would be to use (if possible) a cluster name with all lowercase letters. This should solved the issue.

    Hope this helps.
    If you need further help on this, tag me in a comment.
    If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.


0 additional answers

Sort by: Most helpful