Slurm scheduler cant be access via ssh or bastion
Hi Support,
I have one schedule that it was mess up by one of our members causing to be unstable and un-accessible via ssh or bastion. I think she was creating some environment to it but end up causing this issue.
Originally, i wanted just to restart the scheduler but i have few concerns and may need some experts advise.
- If I restart the scheduler, will the current jobs running in the nodes also be cancelled?
- One observation i have when i restart the scheduler, for some reason my /share/home folder is not being mount, causing unable to access via ssh.
- if i wanted to terminate the scheduler and start again, my public IP will be replaced by a new one which i dont like.
let me know what is the best action in my situation.
Azure CycleCloud
-
Markapuram Sudheer Reddy • 1,590 Reputation points • Microsoft External Staff
2025-04-09T18:59:12.7733333+00:00 Hi Jason Gumarang,
What changes were altered in the setup and were any network settings been changed.
Could you please confirm if any changes were performed in configuration files or scripts that manage the scheduler.
-
Jason Gumarang • 20 Reputation points
2025-04-16T11:57:40.1033333+00:00 Hi Support,
Sorry for the late response.
Basically, there was no changes made on the scheduler, it seems some of my colleague accidentally run their job locally in the scheduler (not using the slurm), causing the scheduler to be unstable due to low resource.
So, there was no setting that were altered or in the network settings, no changes as well in the configuration files or the scripts. Is just these two observations i cant find a way to retain its public network IP (Terminate) and its /share/home folder after reboot. When i reboot, the public IP can remain but the /share/home folder does not retain (causing my users unable to login because the home folder is blank, no ssh keys)
regards,
Jason -
Arko • 1,655 Reputation points • Microsoft External Staff
2025-04-17T05:56:13.1966667+00:00 Hello Jason Gumarang, it appears that there is a resource exhaustion on the scheduler node most likely due to someone running a heavy job directly on the scheduler VM instead of submitting through Slurm. Could you verify it once from your end?
and to answer your question- If I restart the scheduler, will current jobs on compute nodes be cancelled?
Ans- No restarting the scheduler VM does not immediately cancel running jobs on compute nodes. However, the Slurm controller on the scheduler will temporarily lose communication with compute nodes. If the downtime is for long, the compute nodes may mark the jobs as NODE_FAIL. Once the scheduler comes back up and slurmctld resumes, it should attempt to reconnect with the nodes. I would suggest before restarting, gracefully stopslurmctld
and bring it back up cleanlysudo systemctl stop slurmctld
sudo reboot
and then sudo systemctl start slurmctld
your next question- Why is
/share/home
not mounted after reboot?
Ans- please once check if the/etc/fstab
entry hasnoauto
or is missingx-systemd.automount
.if not there then add it- <your-nfs-server>:/share/home /share/home nfs defaults,_netdev,x-systemd.automount 0 0
after that reboot and verify- mount | grep /share/home
if still not mounted then manually do a sudo mount -a
your next question- How to retain the public IP of the scheduler if I terminate or redeploy?
Ans- In Azure, a VM’s public IP is dynamic by default, meaning it changes on deallocation or termination. If you want, you can update it to static or create a new static ip and assign it to your VM's NIC -
Jason Gumarang • 20 Reputation points
2025-04-17T11:46:58.29+00:00 - Resource exhaustion - yes, this is correct. the scheduler ran out of resources.
- Scheduler Restart - Got it. This is clear.
- Why is
/share/home
not mounted after reboot? - it seems cyclecloud has a way (maybe through jetbpack) to mount my /share/home. So basically, my /home is somewhat link to /share/home. All of my users in my cyclecloud user are sync in my /share/home and the weird part is that /home and /share/home are not soft link to each other. It seems cycloud does not include a persistent configuration in my fstab but it shows in df -h-
fstab file:
UUID=b2f6e39f-3593-4dd6-a2ab-c020bcfe5189 / xfs defaults 0 0
UUID=c04fc77f-2590-4b64-842b-3ded3a0df30f /boot xfs defaults 0 0
UUID=2822-2755 /boot/efi vfat defaults,uid=0,gid=0,umask=077,shortname=winnt 0 2
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,_netdev,comment=cloudconfig 0 2
df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.8G 0 7.8G 0% /dev
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 7.9G 777M 7.1G 10% /run
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/sdc4 63G 44G 20G 69% /
/dev/sdc3 1014M 277M 738M 28% /boot
/dev/sdc2 200M 5.9M 194M 3% /boot/efi
/dev/sdd1 147G 164K 140G 1% /mnt
/dev/mapper/vg_cyclecloud_builtinsched-lv0 30G 247M 30G 1% /sched
/dev/mapper/vg_cyclecloud_builtinshared-lv0 2.0T 127G 1.9T 7% /shared
i tried adding my logical volume in my fstab and shutdown and start the scheduler, now, the scheduler is just stuck "waiting for virtual machine" in my cycloud dashboard.
-
Arko • 1,655 Reputation points • Microsoft External Staff
2025-04-17T14:03:56.5033333+00:00 Hello Jason Gumarang, correct, in Azure CycleCloud, the /home and /shared/home (or /share/home) relationship is not always a simple symlink. Instead, Jetpack scripts usually mount a logical volume (like /shared) and dynamically bind-mount user directories into /home. This happens during Jetpack runtime configuration, and not through /etc/fstab.
Why /home and /share/home are not in fstab?
ans- CycleCloud uses Jetpack (usually /opt/cycle/jetpack) to handle shared filesystem mounts (/shared, /sched) or user sync and SSH key injection or for mounting /shared/home/<username> into /home/<username> using bind mounts at runtime
So, the absence in your /etc/fstab is expected as it’s Jetpack’s job.
Why did the scheduler VM hang on “Waiting for virtual machine”?
Ans- This likely happened because of a misconfiguration in fstab. If you manually added a logical volume (like /shared) to fstab but didn't match the correct device path or used incorrect options, the OS might fail to boot, hang at mount time, or Jetpack services could get stuck waiting. In Azure CycleCloud, this breaks VM startup because Jetpack waits for dependencies like mount points to be ready or a failed fstab mount can block systemd from completing boot, making the VM appear “stuck”.
What is the way to Get Scheduler Back?
Ans- Start VM in Recovery Mode (or attach disk to another VM). If you can't SSH or connect via serial console, use Azure Portal to stop the scheduler VM and detach the OS disk and attach it to another Linux VM as a data disk.
Mount the OS Disk and Edit fstab-
sudo mkdir /mnt/scheduler sudo mount /dev/sdX4 /mnt/scheduler # Replace sdX with actual disk ID sudo nano /mnt/scheduler/etc/fstab
Comment out or remove the custom /shared or logical volume mount you added. Save, unmount, and reattach the disk to original scheduler VM using sudo umount /mnt/scheduler
Start the scheduler VM. It should now boot normally and reappear in CycleCloud.
-
Arko • 1,655 Reputation points • Microsoft External Staff
2025-04-17T14:13:01.5333333+00:00 Jason Gumarang, hopefully I was able to answer all your questions. If there is no more question on this, I can combine our conversation and update the same as answer for you to accept it for anyone else facing similar queries on MS QnA forum to refer to my suggestions. Thanks
-
Jason Gumarang • 20 Reputation points
2025-04-17T14:41:50.9933333+00:00 thanks Support. i will review this information and i will get back to you by next week wednesday. i hope you can keep this open as its holiday tomorrow and on monday here in my region.
-
Arko • 1,655 Reputation points • Microsoft External Staff
2025-04-23T03:51:39.1266667+00:00 Hello Jason Gumarang, hope you got a chance to check out my suggestion.
-
Jason Gumarang • 20 Reputation points
2025-04-23T07:17:51.5266667+00:00 Hi Support,
i got a chance to check on this and it seems the mitigation takes a lot of effort to do which is not feasible as of the current. My biggest issue is that if i reboot the scheduler, my /share/home is gone, thus, my users are unable to login and lost files. If i will move this to a another scheduler, all my users installed packages (specifics to python modules under their environments) also lost and they need to install it again (from my understanding). All of my application that was installed globally also lost.
as of the current, i am still lost after a schedule is being rebooted, what do i need to do or configure so that it will boot normally with all of my data are intact.
Sign in to comment