Azure Connectivity problems with Infiniband/RDMA HBv3 VMs

Antoine Kaufmann 96 Reputation points
2021-08-13T15:11:31.233+00:00

Hello,

I have been experimenting for quite a while simply spinning up two HPC VMs on azure to get basic RDMA applications (libibverbs etc.) to work, but so far with little luck. I successfully spun up HBv3 instances with the Ubuntu HPC 20.04 image. But even basic tests such as ibv_rc_pingpong, ibv_ub_pingpong/... (also tried the send and write variants). And the connection negotiation over TCP works (address info shows up on both sides, as shown below), but data operations just fail after retrys are exhausted:

Server:
$ ibv_rc_pingpong -d mlx5_ib0
local address: LID 0x03cc, QPN 0x000912, PSN 0xa44ebe, GID ::
remote address: LID 0x03ca, QPN 0x000912, PSN 0x5edfd9, GID ::

Client:
$ ibv_rc_pingpong -d mlx5_ib0 10.0.0.4
local address: LID 0x03ca, QPN 0x000912, PSN 0x5edfd9, GID ::
remote address: LID 0x03cc, QPN 0x000912, PSN 0xa44ebe, GID ::
Failed status transport retry counter exceeded (12) for wr_id 2

If I look at the NIC counters I also see on both sides that most/all packets don't make it to the other side. (similar picture on both sides):

$ tail /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/*
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/duplicate_request <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/implied_nak_seq_err <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/lifespan <==
12

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/local_ack_timeout_err <==
126

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/out_of_buffer <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/out_of_sequence <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/packet_seq_err <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_cqe_error <==
18

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_cqe_flush_error <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_remote_access_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_remote_invalid_request <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_cqe_error <==
130

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_cqe_flush_error <==
128

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_local_length_error <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_remote_access_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rnr_nak_retry_err <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_adp_retrans <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_adp_retrans_to <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_slow_restart <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_slow_restart_cnps <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_slow_restart_trans <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_atomic_requests <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_dct_connect <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_read_requests <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_write_requests <==
0
$ tail /sys/class/infiniband/mlx5_ib0/ports/1/counters/*
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/VL15_dropped <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/excessive_buffer_overrun_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/link_downed <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/link_error_recovery <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/local_link_integrity_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/multicast_rcv_packets <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/multicast_xmit_packets <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_constraint_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_data <==
72

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_packets <==
1

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_remote_physical_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_switch_relay_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_constraint_errors <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_data <==
42456

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_discards <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_packets <==
207

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_wait <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/symbol_error <==
0

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/unicast_rcv_packets <==
1

==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/unicast_xmit_packets <==
207

And here is how I start the two instances (I have also tried having them in the same placement group, no difference):
az vm create --resource-group MyTestRG --name MyTestVM --image microsoft-dsvm:ubuntu-hpc:2004:20.04.2021051401 --size Standard_HB120rs_v3 --ssh-key-name MyKey --priority Spot --eviction-policy Delete --max-price 1.5 --count 2 --accelerated-networking

The network security group is just setup with the defaults.

Am I missing any obvious configuration for spinning up the VM or initialization in the VM I need to do? Any pointers or suggestions for how to debug further would be welcome

Thanks,
Antoine

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
7,165 questions
0 comments No comments
{count} votes

Accepted answer
  1. Antoine Kaufmann 96 Reputation points
    2021-08-16T07:03:21.287+00:00

    Answering my own question for posterity here: Turns out the problem was that I did not assign the VMs to the same scale set or availability group. I misread the documentation and assumed a proximity placement group by itself should be sufficient as well.

    Once I create the VMs as part of a scale set everything works as expected.

    0 comments No comments

0 additional answers

Sort by: Most helpful