Skip to content

v4.1.5 UCX_NET_DEVICES not selecting TCP devices correctly #12785

@bertiethorpe

Description

@bertiethorpe

Details of the problem

  • OS version (e.g Linux distro)
    • Rocky Linux release 9.4 (Blue Onyx)
  • Driver version:
    • rdma-core-2404mlnx51-1.2404066.x86_64
    • MLNX_OFED_LINUX-24.04-0.6.6.0

Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.

I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.

Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency

HW information from ibstat or ibv_devinfo -vv command :

        hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.36.1010
        node_guid:                      fa16:3eff:fe4f:f5e9
        sys_image_guid:                 0c42:a103:0003:5d82
        vendor_id:                      0x02c9
        vendor_part_id:                 4124
        hw_ver:                         0x0
        board_id:                       MT_0000000224
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

How ompi is configured from ompi_info | grep Configure :

 Configured architecture: x86_64-pc-linux-gnu
 Configured by: abuild
 Configured on: Thu Aug  3 14:25:15 UTC 2023
 Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
                                             '--disable-static' '--enable-builtin-atomics'
                                             '--with-sge' '--enable-mpi-cxx'
                                             '--with-hwloc=/opt/ohpc/pub/libs/hwloc'
                                             '--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
                                             '--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0'
                                             '--without-verbs' '--with-tm=/opt/pbs/'

Following the advice from Here, it is apparently due to a higher priority of OpenMPI's btl/openib component but I don't think it can be if --without-verbs and openib is not available when searching ompi_info | grep btl.

As suggested in the UCX issue, adding -mca pml_ucx_tls any -mca pml_ucx_devices any to my mpirun has fixed this problem, but I was wondering what in the MCA precisely causes this behaviour.

Here's my batch script:

#!/usr/bin/env bash

#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard

module load gnu12 openmpi4 imb

export UCX_NET_DEVICES=mlx5_0:1

echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES

export UCX_LOG_LEVEL=data
mpirun -mca pml_ucx_tls any -mca pml_ucx_devices any IMB-MPI1 pingpong -iter_policy off

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions