Hello,
I have been experimenting for quite a while simply spinning up two HPC VMs on azure to get basic RDMA applications (libibverbs etc.) to work, but so far with little luck. I successfully spun up HBv3 instances with the Ubuntu HPC 20.04 image. But even basic tests such as ibv_rc_pingpong, ibv_ub_pingpong/... (also tried the send and write variants). And the connection negotiation over TCP works (address info shows up on both sides, as shown below), but data operations just fail after retrys are exhausted:
Server:
$ ibv_rc_pingpong -d mlx5_ib0
local address: LID 0x03cc, QPN 0x000912, PSN 0xa44ebe, GID ::
remote address: LID 0x03ca, QPN 0x000912, PSN 0x5edfd9, GID ::
Client:
$ ibv_rc_pingpong -d mlx5_ib0 10.0.0.4
local address: LID 0x03ca, QPN 0x000912, PSN 0x5edfd9, GID ::
remote address: LID 0x03cc, QPN 0x000912, PSN 0xa44ebe, GID ::
Failed status transport retry counter exceeded (12) for wr_id 2
If I look at the NIC counters I also see on both sides that most/all packets don't make it to the other side. (similar picture on both sides):
$ tail /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/*
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/duplicate_request <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/implied_nak_seq_err <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/lifespan <==
12
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/local_ack_timeout_err <==
126
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/out_of_buffer <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/out_of_sequence <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/packet_seq_err <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_cqe_error <==
18
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_cqe_flush_error <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_remote_access_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/req_remote_invalid_request <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_cqe_error <==
130
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_cqe_flush_error <==
128
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_local_length_error <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/resp_remote_access_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rnr_nak_retry_err <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_adp_retrans <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_adp_retrans_to <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_slow_restart <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_slow_restart_cnps <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/roce_slow_restart_trans <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_atomic_requests <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_dct_connect <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_read_requests <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/hw_counters/rx_write_requests <==
0
$ tail /sys/class/infiniband/mlx5_ib0/ports/1/counters/*
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/VL15_dropped <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/excessive_buffer_overrun_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/link_downed <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/link_error_recovery <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/local_link_integrity_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/multicast_rcv_packets <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/multicast_xmit_packets <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_constraint_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_data <==
72
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_packets <==
1
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_remote_physical_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_rcv_switch_relay_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_constraint_errors <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_data <==
42456
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_discards <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_packets <==
207
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/port_xmit_wait <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/symbol_error <==
0
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/unicast_rcv_packets <==
1
==> /sys/class/infiniband/mlx5_ib0/ports/1/counters/unicast_xmit_packets <==
207
And here is how I start the two instances (I have also tried having them in the same placement group, no difference):
az vm create --resource-group MyTestRG --name MyTestVM --image microsoft-dsvm:ubuntu-hpc:2004:20.04.2021051401 --size Standard_HB120rs_v3 --ssh-key-name MyKey --priority Spot --eviction-policy Delete --max-price 1.5 --count 2 --accelerated-networking
The network security group is just setup with the defaults.
Am I missing any obvious configuration for spinning up the VM or initialization in the VM I need to do? Any pointers or suggestions for how to debug further would be welcome
Thanks,
Antoine