-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Hi,
I’m currently testing SHARP in UCC. I’m using HPC-X v2.23, which includes UCC v1.4.4. With HPC-X, my software stack is OSU Micro-Benchmarks, Open MPI v4.1.7, UCC v1.4.4, and UCX v1.19.0, running on a cluster with SHARP-enabled switch.
My goal is to measure pure SHARP performance for Allreduce, Allgather, Reduce-Scatter, and Broadcast. I believe all of these operations have SHARP implementations in UCC.
I tried to force SHARP for all message sizes, but I found that it always uses its internal score map rather than the SHARP backend. For Allreduce and Broadcast, it follows the score map with a 4 KB threshold between UCP and SHARP. For Allgather and Reduce-Scatter, even though the score map threshold is 16 KB, it still seems to always use the UCP backend.
I’m wondering whether this is because my SHARP-forcing settings are incorrect, or because UCC internally decides between UCP and SHARP (or uses only UCP) based on its own selection logic.
I’ve attached my benchmark code and the Allgather output log.
HPC-X setup
export HPCX_HOME=$PWD
source $HPCX_HOME/hpcx-init.sh
hpcx_load
Benchmark Code
I used UCC_TL_SHARP_TUNE env var, but the score map was not changed.
mpirun -np $1 \
--bind-to core \
--map-by ppr:1:node \
--hostfile ./hostfile-2 \
-x LD_LIBRARY_PATH \
-x SHARP_COLL_ENABLE_SAT=1 \
-x SHARP_COLL_LOG_LEVEL=3 \
-x UCC_CLS=basic \
-x UCC_TL_SHARP_TUNE=allgather:inf \
-x UCC_TL_SHARP_DEVICES=mlx5_0 \
-x UCC_LOG_LEVEL=INFO \
-x UCC_COLL_TRACE=INFO \
-x OMPI_UCC_CL_BASIC_TLS=ucp,sharp \
$HPCX_OSU_DIR/osu_allgather -x 20 -i 200 -m 4:8388608 > log.out
output - score map
You can see TL_SHARP is selected for {16K..inf}
[1765938844.536867] [-:20594:0] ucc_team.c:471 UCC INFO ===== COLL_SCORE_MAP (team_id 32768, size 16) =====
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Allgather:
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..16383}:TL_UCP:10 {16K..inf}:TL_SHARP:10
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Cuda: {16K..inf}:TL_SHARP:10
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO CudaManaged: {16K..inf}:TL_SHARP:10
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Rocm: {16K..inf}:TL_SHARP:10
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO RocmManaged: {16K..inf}:TL_SHARP:10
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Allreduce:
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..4094}:CL_HIER:50 {4095..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Cuda: {0..inf}:TL_SHARP:10
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO CudaManaged: {0..inf}:TL_SHARP:10
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Rocm: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO RocmManaged: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10
[1765938844.536924] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Alltoall:
[1765938844.536924] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..2063}:TL_UCP:10 {2064..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1765938844.536930] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Alltoallv:
[1765938844.536930] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Barrier:
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:CL_HIER:50
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Cuda: {0..inf}:TL_SHARP:10
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO CudaManaged: {0..inf}:TL_SHARP:10
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Rocm: {0..inf}:TL_SHARP:10
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO RocmManaged: {0..inf}:TL_SHARP:10
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Bcast:
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..4K}:CL_HIER:50 {4097..32767}:TL_SHARP:10 {32K..inf}:TL_SHARP:10
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Cuda: {0..inf}:TL_SHARP:10
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO CudaManaged: {0..inf}:TL_SHARP:10
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Rocm: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO RocmManaged: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10
[1765938844.536971] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Fanin:
[1765938844.536971] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.536978] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Fanout:
[1765938844.536978] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.536983] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Gather:
[1765938844.536983] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.536990] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Gatherv:
[1765938844.536990] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Reduce:
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..4K}:CL_HIER:50 {4097..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Cuda: {0..4K}:CL_HIER:50
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO CudaManaged: {0..4K}:CL_HIER:50
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Rocm: {0..4K}:CL_HIER:50
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO RocmManaged: {0..4K}:CL_HIER:50
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Reduce_scatter:
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..16383}:TL_UCP:10 {16K..inf}:TL_SHARP:10
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Cuda: {16K..inf}:TL_SHARP:10
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO CudaManaged: {16K..inf}:TL_SHARP:10
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Rocm: {16K..inf}:TL_SHARP:10
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO RocmManaged: {16K..inf}:TL_SHARP:10
[1765938844.537032] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Reduce_scatterv:
[1765938844.537032] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.537037] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Scatterv:
[1765938844.537037] [-:20594:0] ucc_coll_score_map.c:225 UCC INFO Host: {0..inf}:TL_UCP:10
[1765938844.537041] [-:20594:0] ucc_team.c:475 UCC INFO ================================================
Output - large message sizes
[1765938844.928066] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.928483] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.928489] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.928906] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.928914] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.929327] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929334] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.929747] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929753] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929766] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Reduce min root 0: src={0x7ffd0a0d89d0, 1, float64, Host}, dst={0x7ffd0a0d89e8, 1, float64, Host}; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929774] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Reduce max root 0: src={0x7ffd0a0d89d0, 1, float64, Host}, dst={0x7ffd0a0d89e0, 1, float64, Host}; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929781] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Reduce sum root 0: src={0x7ffd0a0d89d0, 1, float64, Host}, dst={0x7ffd0a0d89d8, 1, float64, Host}; CL_HIER {TL_UCP}, team_id 32768
262144 415.87
[1765938844.929794] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929799] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929805] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Allgather: src={0x14eb5765c000, 524288, int8, Host}, dst={0x14eb47fff000, 8388608, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.932148] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.932161] [-:20594:0] ucc_coll.c:301 UCC_COLL INFO coll_init: Allgather: src={0x14eb5765c000, 524288, int8, Host}, dst={0x14eb47fff000, 8388608, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
Also, here's ucp_info -s result
Default CLs scores: basic=10 hier=50
Default TLs scores: mlx5=1 self=50 sharp=30 shm=100 ucp=10
I'm running this on 16 nodes allocation, not GPU only CPU hosts setting. Please let me know if I'm missing anything or need to provide any additional information.
Here are Log files for Allgather, Allreduce, Broadcast, and Reduce Scatter. Benchmark code is exactly same except collective name.