Azure CycleCloud - terrible HPC CFD performance and scaling vs on-prem benchmark?

Gary Mansell 111 Reputation points
2022-12-15T10:42:54.12+00:00

Hi,

I have setup a PoC Azure CycleCloud Slurm Cluster to evaluate cost & performance vs on-prem.

Our nearest comparison cluster on-prem has an older CPU but slightly faster clockspeed and slower infiniband performance - so they should be in the same ball park.

HC44RS – 1 Node / 20 cores = 3hr 20 (vs 1hr 45 on-prem)
HC44RS – 2 Node / 60 cores = 2hr 25 (vs 1hr 10 on-prem)

As you can see the Azure VMs are and factor of 2 slower than our on-prem machines - there must be something wrong here (as I was expecting about a 10-15% virtualisation penalty).

I wanted to try several different HPC node OS's to see if that/driver versions would help - but the only one I can get working is CentOS7 (the others give a DNS error - see my other post here about that).

I am running my CFD code across the expected number of processors and getting 98% utilisation of the cores. I was using the intel 2018.4 MPI that is in the CentOS7 OS image.

Any suggestions what might be wrong, or what I can do to improve things?

Thanks

Gary

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
62 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Ravi Kanth Koppala 3,231 Reputation points Microsoft Employee
    2022-12-18T06:59:35.48+00:00

    @Gary Mansell ,

    There are a few potential reasons why the Azure VMs may be performing slower than your on-premises machines:

    1. Virtualization overhead: Virtualization can introduce a performance penalty, but this is usually 5-15%. A factor of 2 slower performance is much higher than expected from virtualization alone, so there may be other factors at play.
    2. Network performance: If the on-premises machines are connected via infiniband, which is a high-speed interconnect used in HPC environments, this could be a bottleneck for the Azure VMs, which are likely connected via standard ethernet. You can try running your workload on a VM with a faster network connection (e.g. Azure HPC cache or Azure Virtual Machines with SR-IOV support) to see if this improves performance.
    3. CPU performance: The on-premises machines may have faster CPUs, even though they have an older generation and a slower clock speed. You can try running a benchmark on the on-premises and Azure VMs to compare their CPU performance and see if this could be a factor.
    4. Software and driver versions: It's also worth checking that you are using the same version of the software and drivers on both the on-premises and Azure VMs. Different versions can have other performance characteristics, so using the same versions on both could help to ensure a fair comparison.
    5. Other factors: Other factors could be at play, such as differences in the workload or the overall system configuration. It may be helpful to run some additional tests and gather more detailed performance metrics to help identify the root cause of the performance difference.
    0 comments No comments

  2. Gary Mansell 111 Reputation points
    2022-12-19T10:38:10.083+00:00

    Virtualisation - we agree, should be approx 5 to 15% degradation
    Networking - we are using 50Gbps Infiniband on prem vs 100Gbps (HC44rs) and 200Gbs (HB120rs_v3) in Azure (so Azure should be much faster)
    CPU Perf - it cannot be this as the clock speed is only 10% faster on-prem, but it is an older cpu generation, so (ball-park) we should be talking a similar speed vs Azure
    Software and Driver versions - maybe, but the only image that seems to work is the Centos 7 HPC image and this has the Intel 2018.4 drivers (which we use on-prem).

    I still don't understand the 2x performance difference...

    0 comments No comments