Problem getting GPU solving to work with our Azure CycleCloud / Slurm HPC cluster System

Question

Problem getting GPU solving to work with our Azure CycleCloud / Slurm HPC cluster System

Gary Mansell 136

I am using the Azure CycleCloud 8.4 Marketplace image and it is fully updated, along with Slurm version 22.05.8-1.

I have configured a GPU Enabled Slurm Partition consisting of some NC24sv3 VMs (which have 4x Nvidia Tesla V100 GPUS in each), but the Slurm Scheduler is showing the Partition as "Invalid" when the processing nodes are created at job run time:

User's image

The Azure Slurm config seems to detect and configure the GPU enabled VMs:

User's image

But, the Slurm node is showing as invalid "as the GPU count reported is less (0) than the (4) that is configured":

User's image

The GPU nodes are built from the Azure HPC AlmaLinux8 image, and when I login to them, the NVIDIA driver is installed and reporting the 4x GPU's:

User's image

Azure Cloud-Init configuration for the VMs is as follows:

User's image

So, I am a bit stuck and hoping someone can help me to get the GPU Slurm partition to initialise correctly - can anyone give me any help or suggestions, please?

Gary Mansell 136

I can add, that I have discovered, that if I manually start the GPU enabled node using the command "azslurm resume <nodename>" as root from the scheduler, then the GPU node seems to come up and does not show the error that the number of GPUs does not match what it was expecting and the state is idle instead of invalid:

User's image

So (I may be wrong), but maybe the node is coming up OK when the job is submitted via sbatch, but it is the job submission process which is going wrong and causing the state to then go to invalid?

Here is my job submission script:

#!/bin/bash
## Job name
#SBATCH --job-name=run-grma
#
## File to write standard output and error
#SBATCH --output=run-grma.out
#SBATCH --error=run-grma.err
#
## Run time on the queue
#SBATCH --time=08:00:00
#
## Partition for the cluster (you might not need that)
#SBATCH --partition=gpunc24sv3
#
## Number of nodes
#SBATCH --nodes=1
#
## Number of CPUs per nodes
#SBATCH --ntasks-per-node=8
#
## Hyperthread cores - set this to 1 to disable HT
#SBATCH --cpus-per-task=1
#
## RAM needed per CPU (you might not need that)
## SBATCH --mem-per-cpu=100M
#
## Account to charge the CPU time (you probably don't need that)
## SBATCH --account=XXXX
#
## Enable GPUS for the node
# SBATCH --gpus-per-node=4

## General
module purge

## Initialise Mechanical
if [ -d /shared/apps/ansys_inc/v231/ansys/bin ]
then
    export PATH=$PATH:/shared/apps/ansys_inc/v231/ansys/bin
else
    echo "Failed to Initialise MAPDL"
fi

echo "******************************************"
echo "The value of SLURM_NTASKS variable is = " $SLURM_NTASKS
echo "******************************************"

## Setup Hosts file
SCHEDULER_HOST_FILE=slurm.$SLURM_JOB_ID.hosts
scontrol show hostnames > $SCHEDULER_HOST_FILE

## Run

#ansys231 -p mech_2 -b nolist -s noread -dis -machines=$SCHEDULER_HOST_FILE -np $SLURM_NTASKS -mpi intelmpi -dir /shared/data/FH13 -i rotary_housing_fe_structural_st40.cdb -o rotary_housing_fe_structural_st40.out
ansys231 -naa NVIDIA -na 2 -p mech_2 -b nolist -s noread -dis -machines=$SCHEDULER_HOST_FILE -np $SLURM_NTASKS -mpi openmpi -i input.cdb -o output.out

# Cleanup
rm $SCHEDULER_HOST_FILE

Thanks

Gary Mansell 136 Reputation points

2023-12-20T15:17:27.1333333+00:00

So there is this bug report that might be related: https://github.com/Azure/cyclecloud-slurm/issues/114

Apparently, the Azure Centos7 and AlmaLinux8 HPC images don't work with the NC VMs (which is what I am using) - is this still an issue (it's old, so surely the image could have been fixed by now)?

Further, they say the fix might be to use the NV series VMs - but I thought that these were for interactive GPU (desktop usage), and the NC VMs for CUDA (solving usage)?

Can anyone shed any light on this?
Gary Mansell 136 Reputation points

2023-12-20T16:48:59.8033333+00:00

Using NV series nodes for GPU processing does not work either (I did not think it would as not configured for CUDA processing).

But, NV32ads_A10_v4 SKU (which I have quota for) - at least allows the partition to initialise and run the job, even if it then fails as there is no CUDA.

I now need to request Quota for NC32ads_A10_v4 SKU - so that I can confirm this both allows the partition to initialise with the Azure HPC image for AlmaLinux8 and allows CUDA processing.

Hope this helps others - as I have spent several days trying to solve this (will confirm later when I have quota)!
deherman-MSFT 38,021 Reputation points Microsoft Employee Moderator

2023-12-22T17:14:28.8166667+00:00

@Gary Mansell Sorry for the delay. I am working the internal service team and will provide an update when possible.
deherman-MSFT 38,021 Reputation points Microsoft Employee Moderator

2024-01-03T17:07:40.3466667+00:00

@Gary Mansell

I have spoken to the CycleCloud team and they have asked that you open a support ticket so they can diagnose this further. If you do not have a support plan, please email the following to AzCommunity@microsoft.com and we'll get back to you promptly:

• Subject: "Attn: deherman - "

• Email body: Your Subscription ID

• Email body: A link to this thread so we can validate and expedite the request

If you don't receive a response within 24 hours, please reply to the thread so we can investigate.
Gary Mansell 136 Reputation points

2024-01-04T09:05:22.5166667+00:00

@deherman-MSFT Thanks, for raising this issue with the CycleCloud team, hopefully they can look into this.

I now have additional Quota's and can confirm that if I use the newer GPU enabled VM SKUs (Standard_NC16as_T4_v3 or Standard_NC24ads_A100_v4) - then CycleCloud initialises the GPU partitions using the Almalinux8 HPC image OS without the error that I reported with the NC24rsv3 VM SKUs and I can go on to run Ansys Mechanical GPU solving successfully.

So, it's definitely an issue with the AlmaLinux8 HPC image and the older GPU's not being seen by Slurm when the VM's are initialised. Maybe the latest Nvidia CUDA driver version does not support the older cards now?

Hopefully this information will be useful to others as I have wasted a lot of time on this problem to date.

I cannot raise a support ticket direct with Microsoft as we have to go via our MSP (who knows nothing about CycleCloud). Can I perhaps open a ticket directly through MS with your help?

TBH - in the first instance, I would have thought the CycleCloud team should easily be able to reproduce this themselves by creating a GPU enabled Slurm partition using the AlmaLinux8 HPC image and NCv24rsv3 VM SKU.

Rgds

Gary
Gary Mansell 136 Reputation points

2024-01-05T14:24:41.5066667+00:00

I am mistaken, I can only get the Standard_NC24ads_A100_v4 family VMs to work with the AlmaLinux8 image for GPU Solving and not the Standard_NC16as_T4_v3 family VMs - they give the same invalid state error in Slurm sinfo output after they are initialised as the NC24sv3's...

@deherman-MSFT I am going to need to open an MS tech support case about this, as I need to be able to to GPU solving on more than just Ampere GPUs (they are expensive) and so I would like to open the support case direct with MS support via you rather than via my MSP who know nothing about CycleCloud. I tried emailing azcommunity@microsoft.com for your attn. per your message, but I get a delivery failure.

How should I proceed?

1 answer

Your answer

Gary Mansell 136 Reputation points

2023-12-20T15:17:27.1333333+00:00

So there is this bug report that might be related: https://github.com/Azure/cyclecloud-slurm/issues/114

Apparently, the Azure Centos7 and AlmaLinux8 HPC images don't work with the NC VMs (which is what I am using) - is this still an issue (it's old, so surely the image could have been fixed by now)?

Further, they say the fix might be to use the NV series VMs - but I thought that these were for interactive GPU (desktop usage), and the NC VMs for CUDA (solving usage)?

Can anyone shed any light on this?
Gary Mansell 136 Reputation points

2023-12-20T16:48:59.8033333+00:00

Using NV series nodes for GPU processing does not work either (I did not think it would as not configured for CUDA processing).

But, NV32ads_A10_v4 SKU (which I have quota for) - at least allows the partition to initialise and run the job, even if it then fails as there is no CUDA.

I now need to request Quota for NC32ads_A10_v4 SKU - so that I can confirm this both allows the partition to initialise with the Azure HPC image for AlmaLinux8 and allows CUDA processing.

Hope this helps others - as I have spent several days trying to solve this (will confirm later when I have quota)!
deherman-MSFT 38,021 Reputation points Microsoft Employee Moderator

2023-12-22T17:14:28.8166667+00:00

@Gary Mansell Sorry for the delay. I am working the internal service team and will provide an update when possible.
deherman-MSFT 38,021 Reputation points Microsoft Employee Moderator

2024-01-03T17:07:40.3466667+00:00

@Gary Mansell

I have spoken to the CycleCloud team and they have asked that you open a support ticket so they can diagnose this further. If you do not have a support plan, please email the following to AzCommunity@microsoft.com and we'll get back to you promptly:

• Subject: "Attn: deherman - "

• Email body: Your Subscription ID

• Email body: A link to this thread so we can validate and expedite the request

If you don't receive a response within 24 hours, please reply to the thread so we can investigate.
Gary Mansell 136 Reputation points

2024-01-04T09:05:22.5166667+00:00

@deherman-MSFT Thanks, for raising this issue with the CycleCloud team, hopefully they can look into this.

I now have additional Quota's and can confirm that if I use the newer GPU enabled VM SKUs (Standard_NC16as_T4_v3 or Standard_NC24ads_A100_v4) - then CycleCloud initialises the GPU partitions using the Almalinux8 HPC image OS without the error that I reported with the NC24rsv3 VM SKUs and I can go on to run Ansys Mechanical GPU solving successfully.

So, it's definitely an issue with the AlmaLinux8 HPC image and the older GPU's not being seen by Slurm when the VM's are initialised. Maybe the latest Nvidia CUDA driver version does not support the older cards now?

Hopefully this information will be useful to others as I have wasted a lot of time on this problem to date.

I cannot raise a support ticket direct with Microsoft as we have to go via our MSP (who knows nothing about CycleCloud). Can I perhaps open a ticket directly through MS with your help?

TBH - in the first instance, I would have thought the CycleCloud team should easily be able to reproduce this themselves by creating a GPU enabled Slurm partition using the AlmaLinux8 HPC image and NCv24rsv3 VM SKU.

Rgds

Gary
Gary Mansell 136 Reputation points

2024-01-05T14:24:41.5066667+00:00

I am mistaken, I can only get the Standard_NC24ads_A100_v4 family VMs to work with the AlmaLinux8 image for GPU Solving and not the Standard_NC16as_T4_v3 family VMs - they give the same invalid state error in Slurm sinfo output after they are initialised as the NC24sv3's...

@deherman-MSFT I am going to need to open an MS tech support case about this, as I need to be able to to GPU solving on more than just Ampere GPUs (they are expensive) and so I would like to open the support case direct with MS support via you rather than via my MSP who know nothing about CycleCloud. I tried emailing azcommunity@microsoft.com for your attn. per your message, but I get a delivery failure.

How should I proceed?

Answer 1

Microsoft Support helped me to a fix on this issue - it was NOT a problem with the AlmaLinux8 HPC image as I had first surmised...

There were two issues - firstly yum had updated my CycleCloud to version 8.5 (from 8.4), which meant my Slurm 3.0.1 Cluster templates were out of date wrt. CycleCloud version. I first needed to download the Azure CycleCloud 3.0.5 default template (https://github.com/Azure/cyclecloud-slurm/blob/master/templates/slurm.txt) and merge it with my custom template, and then create a new 8.5 Cluster with the correct Slurm template version.

This was the main cause of the problem - for some reason only Ampere GPU jobs ran with this out of date configuration, which confused things, we don't really know why. But this messed up configuration was causing there to be no gres.conf (outlining the gpus available) to be in /sched, nor linked to in /etc/slurm

Once the cluster template was fixed and new cluster built, I just had to ensure that my slurm job runit script had the following option in it so that there were GPUs available to my job

## Specify the number of GPUs for the task
#SBATCH --gres=gpu:4

Then, the GPU nodes would initialise correctly in Slurm and I could then run jobs against the GPUs.

Share via

Problem getting GPU solving to work with our Azure CycleCloud / Slurm HPC cluster System

1 answer

Your answer