How to do RDMA on Azure HC44 VMs?

Ada Liu 0 Reputation points
2023-05-15T18:42:33.07+00:00

Hi, I am trying to connect two Azure HC44-16rs VMs in the same scale set through RDMA. The OS images are "OpenLogic:CentOS-HPC:7_9-gen2:7.9.2022040101". However, when I do rping -s on the server side and rping -c -a <ServerPublicIP> on the client side, the server side will be pending while the client side gives:

cma event RDMA_CM_EVENT_ADDR_ERROR, error -19 
waiting for addr/route resolution state 1

The ib_devinfo command's output looks good to me on both sides, like this:

hca_id: mlx5_ib0
transport:                      InfiniBand (0)
fw_ver:                         xx.xx.xxxx
node_guid:                      xxxx:xxxx:xxxx:xxxx
sys_image_guid:                 xxxx:xxxx:xxxx:xxxx
...
vendor_part_id:                 4120
hw_ver:                         0x0
board_id:                       MT_0000000010
phys_port_cnt:                  1
port:   1
state:                  PORT_ACTIVE (4)
max_mtu:                4096 (5)
active_mtu:             4096 (5)
sm_lid:                 2
port_lid:               790
port_lmc:               0x00
link_layer:             InfiniBand

hca_id: mlx5_an0
        transport:                      InfiniBand (0)
        fw_ver:                         xx.xx.xxxx
        node_guid:                      xxxx:xxxx:xxxx:xxxx
        sys_image_guid:                 0000:0000:0000:0000
        ...
        vendor_part_id:                 4118
        hw_ver:                         0x80
        board_id:                       MSF0010110035
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

I can ping the other VM on one VM through the public IP, and I have open the IPoIB by running the following commands as suggested here on both my VMs:

sudo sed -i -e 's/# OS.EnableRDMA=n/OS.EnableRDMA=y/g' /etc/waagent.conf
sudo systemctl restart waagent

Also, in my C code using libibverbs, when I do rdma_connect with cm_client_id of type struct rdma_cm_id, it will give me an RDMA_CM_EVENT_ADDR_ERROR event only after rdma_connect is called, while both the address and route resolve succeed. The code works well with another setup I have (server and client connected by Soft-RoCE, and addresses specified by IP addresses). Do I need to specify the addresses in another way (ex., use UID instead of IP addresses) with InfiniBand connection, or should I have configured the network of Azure VMs specially (I have allowed all inbound and outbound data under my scale set's "Networking" tag), or could there be other problems?

The pseudo-code is like this:

// set server_addr with the server's public ip address
ret = rdma_resolve_addr(cm_client_id, NULL, (struct sockaddr*) &server_addr, 2000);
// get RDMA_CM_EVENT_ADDR_RESOLVED event successfully
rdma_resolve_route(cm_client_id, ...);
// get RDMA_CM_EVENT_ROUTE_RESOLVED event successfully
// allocate protection domain, create completion channel & completion queue, get notification on the completion queue and create queue pairs, all succeed
pd = ibv_alloc_pd(cm_client_id->verbs);
comp_channel = ibv_create_comp_channel(cm_client_id->verbs);
comp_queue = ibv_create_cq(cm_client_id->verbs, ...);
ibv_req_notify_cq(comp_queue, ...);
rdma_create_qp(cm_client_id, ...);
// set up struct rdma_conn_param conn_param;
ret = rdma_connect(cm_client_id, &conn_param);
// should get RDMA_CM_EVENT_ESTABLISHED but get RDMA_CM_EVENT_ADDR_ERROR event

Thanks!

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
7,566 questions
{count} votes