-
Notifications
You must be signed in to change notification settings - Fork 946
Description
I found, by accident, that setting UCX_IB_MLX5_DEVX=no gives ~2x perf improvement for one customer code on Azure Genoa hbv4 VMs ( https://learn.microsoft.com/en-us/azure/virtual-machines/hbv4-series-overview). Each VM has a single 400 Gb/s Mellanox ConnectX-7 NDR nic.
So far none of the other codes I've tried show any sensitivity to this setting.
Very similar ~2x perf boost is seen on this code on bare metal Genoa IB cluster (also 400 Gb/sec (4X NDR).
Howver, on the same bare metal cluster, Turin nodes show no sensitivity to this setting.
The code has mostly allreduce and alltoallv. Alltoallv calls are "sparse" -- 4 ranks are either sending to all other ranks, or receiving from all other ranks. The typical scale of my jobs is ~1k ranks, no OpenMP.
Recently, I observed another behaviour affected by setting UCX_IB_MLX5_DEVX=no .
This is on NOAA HAFS code (https://github.com/HAFS-community/HAFS).
I only tried this code on the Azure hbv4 platform.
Using recent master branches of pmix, prrte and openmpi.
I start the job with:
mpiexec --display-map --bind-to core --map-by ppr:88:node:pe=2 \
-n 3072 /usr/bin/env OMP_NUM_THREADS=2 ${exe} : \
-n 32 /usr/bin/env OMP_NUM_THREADS=2 ${exe} : \
-n 240 /usr/bin/env OMP_NUM_THREADS=2 ${exe}
The jobs hangs somewhere in ucx layers.
However, if I set UCX_IB_MLX5_DEVX=no, the job runs to completion, with performance comparable to intel mpi.
If I try to use 4 omp threads/rank, i.e:
mpiexec --display-map --bind-to core--map-by ppr:44:node:pe=4 \
-n 3072 /usr/bin/env OMP_NUM_THREADS=4 ${exe} : \
-n 32 /usr/bin/env OMP_NUM_THREADS=4 ${exe} : \
-n 240 /usr/bin/env OMP_NUM_THREADS=4 ${exe}
the job hangs regardless of whether UCX_IB_MLX5_DEVX is set to yes or no.
I'm not sure what to make of these observations.