How to do RDMA on Azure HC44 VMs?
Hi, I am trying to connect two Azure HC44-16rs VMs in the same scale set through RDMA. The OS images are "OpenLogic:CentOS-HPC:7_9-gen2:7.9.2022040101". However, when I do
rping -s on the server side and
rping -c -a <ServerPublicIP> on the client side, the server side will be pending while the client side gives:
cma event RDMA_CM_EVENT_ADDR_ERROR, error -19 waiting for addr/route resolution state 1
ib_devinfo command's output looks good to me on both sides, like this:
hca_id: mlx5_ib0 transport: InfiniBand (0) fw_ver: xx.xx.xxxx node_guid: xxxx:xxxx:xxxx:xxxx sys_image_guid: xxxx:xxxx:xxxx:xxxx ... vendor_part_id: 4120 hw_ver: 0x0 board_id: MT_0000000010 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 2 port_lid: 790 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_an0 transport: InfiniBand (0) fw_ver: xx.xx.xxxx node_guid: xxxx:xxxx:xxxx:xxxx sys_image_guid: 0000:0000:0000:0000 ... vendor_part_id: 4118 hw_ver: 0x80 board_id: MSF0010110035 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet
I can ping the other VM on one VM through the public IP, and I have open the IPoIB by running the following commands as suggested here on both my VMs:
sudo sed -i -e 's/# OS.EnableRDMA=n/OS.EnableRDMA=y/g' /etc/waagent.conf sudo systemctl restart waagent
Also, in my C code using libibverbs, when I do
cm_client_id of type
struct rdma_cm_id, it will give me an
RDMA_CM_EVENT_ADDR_ERROR event only after
rdma_connect is called, while both the address and route resolve succeed. The code works well with another setup I have (server and client connected by Soft-RoCE, and addresses specified by IP addresses). Do I need to specify the addresses in another way (ex., use UID instead of IP addresses) with InfiniBand connection, or should I have configured the network of Azure VMs specially (I have allowed all inbound and outbound data under my scale set's "Networking" tag), or could there be other problems?
The pseudo-code is like this:
// set server_addr with the server's public ip address ret = rdma_resolve_addr(cm_client_id, NULL, (struct sockaddr*) &server_addr, 2000); // get RDMA_CM_EVENT_ADDR_RESOLVED event successfully rdma_resolve_route(cm_client_id, ...); // get RDMA_CM_EVENT_ROUTE_RESOLVED event successfully // allocate protection domain, create completion channel & completion queue, get notification on the completion queue and create queue pairs, all succeed pd = ibv_alloc_pd(cm_client_id->verbs); comp_channel = ibv_create_comp_channel(cm_client_id->verbs); comp_queue = ibv_create_cq(cm_client_id->verbs, ...); ibv_req_notify_cq(comp_queue, ...); rdma_create_qp(cm_client_id, ...); // set up struct rdma_conn_param conn_param; ret = rdma_connect(cm_client_id, &conn_param); // should get RDMA_CM_EVENT_ESTABLISHED but get RDMA_CM_EVENT_ADDR_ERROR event
Setting up RDMA (Remote Direct Memory Access) on Azure HC44 VMs requires specific configurations and considerations. Here are a few points to check and potential solutions:
Enable RDMA on Azure VMs: Make sure that RDMA is enabled on your Azure VMs. You mentioned that you modified the
waagent.conffile and restarted the waagent service, which is the correct step. However, ensure that the changes are taking effect by verifying the
/var/lib/waagent/ovf-env.xmlfile and confirming that
OS.EnableRDMAis set to
Network Configuration: Azure VMs use the RDMA over Converged Ethernet (RoCE) protocol, which requires specific network configurations. Ensure that you have created an Azure Virtual Network (VNet) and associated Subnets for your VMs. Also, make sure that the VMs are in the same subnet and can communicate with each other.
Network Security Group (NSG) Rules: Check the Network Security Group rules for your Azure VMs to ensure that they allow the necessary inbound and outbound traffic for RDMA communication. Ensure that the required ports are open for RDMA, both TCP and UDP. Refer to the official Azure documentation for the specific ports and protocols needed for RDMA.
InfiniBand Driver: Azure HC-series VMs utilize Mellanox InfiniBand network adapters. Confirm that the InfiniBand driver is installed and functioning correctly on your VMs. You can check the driver version using the
Endpoint Discovery: Instead of specifying IP addresses directly, you should use the GID (Global Identifier) and QPN (Queue Pair Number) to establish the connection. The GID uniquely identifies the network interface, and the QPN represents the queue pair number associated with the RDMA connection. You can retrieve these values using the
rdma_get_cm_event_paramfunction after receiving the
Azure Networking: Confirm that your Azure subscription and Azure region support RDMA. Some regions may not have RDMA capabilities available. Review the Azure documentation and consult with Azure support if needed.
If the issue persists after checking these points, it's recommended to reach out to Microsoft Azure support for further assistance. They can provide more specific guidance and help troubleshoot the RDMA connectivity problem in your Azure environment.
Sign in to comment