-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Details of the problem
- OS version (e.g Linux distro)
- Rocky Linux release 9.4 (Blue Onyx)
- Driver version:
- rdma-core-2404mlnx51-1.2404066.x86_64
- MLNX_OFED_LINUX-24.04-0.6.6.0
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency
HW information from ibstat or ibv_devinfo -vv command :
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.36.1010
node_guid: fa16:3eff:fe4f:f5e9
sys_image_guid: 0c42:a103:0003:5d82
vendor_id: 0x02c9
vendor_part_id: 4124
hw_ver: 0x0
board_id: MT_0000000224
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
How ompi is configured from ompi_info | grep Configure :
Configured architecture: x86_64-pc-linux-gnu
Configured by: abuild
Configured on: Thu Aug 3 14:25:15 UTC 2023
Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
'--disable-static' '--enable-builtin-atomics'
'--with-sge' '--enable-mpi-cxx'
'--with-hwloc=/opt/ohpc/pub/libs/hwloc'
'--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
'--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0'
'--without-verbs' '--with-tm=/opt/pbs/'
Following the advice from Here, it is apparently due to a higher priority of OpenMPI's btl/openib component but I don't think it can be if --without-verbs and openib is not available when searching ompi_info | grep btl.
As suggested in the UCX issue, adding -mca pml_ucx_tls any -mca pml_ucx_devices any to my mpirun has fixed this problem, but I was wondering what in the MCA precisely causes this behaviour.
Here's my batch script:
#!/usr/bin/env bash
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard
module load gnu12 openmpi4 imb
export UCX_NET_DEVICES=mlx5_0:1
echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES
export UCX_LOG_LEVEL=data
mpirun -mca pml_ucx_tls any -mca pml_ucx_devices any IMB-MPI1 pingpong -iter_policy off