How to do RDMA on Azure HC44 VMs?

Hi, I am trying to connect two Azure HC44-16rs VMs in the same scale set through RDMA. The OS images are "OpenLogic:CentOS-HPC:7_9-gen2:7.9.2022040101". However, when I do rping -s
on the server side and rping -c -a <ServerPublicIP>
on the client side, the server side will be pending while the client side gives:
cma event RDMA_CM_EVENT_ADDR_ERROR, error -19
waiting for addr/route resolution state 1
The ib_devinfo
command's output looks good to me on both sides, like this:
hca_id: mlx5_ib0
transport: InfiniBand (0)
fw_ver: xx.xx.xxxx
node_guid: xxxx:xxxx:xxxx:xxxx
sys_image_guid: xxxx:xxxx:xxxx:xxxx
...
vendor_part_id: 4120
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 790
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_an0
transport: InfiniBand (0)
fw_ver: xx.xx.xxxx
node_guid: xxxx:xxxx:xxxx:xxxx
sys_image_guid: 0000:0000:0000:0000
...
vendor_part_id: 4118
hw_ver: 0x80
board_id: MSF0010110035
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
I can ping the other VM on one VM through the public IP, and I have open the IPoIB by running the following commands as suggested here on both my VMs:
sudo sed -i -e 's/# OS.EnableRDMA=n/OS.EnableRDMA=y/g' /etc/waagent.conf
sudo systemctl restart waagent
Also, in my C code using libibverbs, when I do rdma_connect
with cm_client_id
of type struct rdma_cm_id
, it will give me an RDMA_CM_EVENT_ADDR_ERROR
event only after rdma_connect
is called, while both the address and route resolve succeed. The code works well with another setup I have (server and client connected by Soft-RoCE, and addresses specified by IP addresses). Do I need to specify the addresses in another way (ex., use UID instead of IP addresses) with InfiniBand connection, or should I have configured the network of Azure VMs specially (I have allowed all inbound and outbound data under my scale set's "Networking" tag), or could there be other problems?
The pseudo-code is like this:
// set server_addr with the server's public ip address
ret = rdma_resolve_addr(cm_client_id, NULL, (struct sockaddr*) &server_addr, 2000);
// get RDMA_CM_EVENT_ADDR_RESOLVED event successfully
rdma_resolve_route(cm_client_id, ...);
// get RDMA_CM_EVENT_ROUTE_RESOLVED event successfully
// allocate protection domain, create completion channel & completion queue, get notification on the completion queue and create queue pairs, all succeed
pd = ibv_alloc_pd(cm_client_id->verbs);
comp_channel = ibv_create_comp_channel(cm_client_id->verbs);
comp_queue = ibv_create_cq(cm_client_id->verbs, ...);
ibv_req_notify_cq(comp_queue, ...);
rdma_create_qp(cm_client_id, ...);
// set up struct rdma_conn_param conn_param;
ret = rdma_connect(cm_client_id, &conn_param);
// should get RDMA_CM_EVENT_ESTABLISHED but get RDMA_CM_EVENT_ADDR_ERROR event
Thanks!