Skip to content

Can't force SHARP with score map tuning #1243

@james-sungjae-lee

Description

@james-sungjae-lee

Hi,

I’m currently testing SHARP in UCC. I’m using HPC-X v2.23, which includes UCC v1.4.4. With HPC-X, my software stack is OSU Micro-Benchmarks, Open MPI v4.1.7, UCC v1.4.4, and UCX v1.19.0, running on a cluster with SHARP-enabled switch.

My goal is to measure pure SHARP performance for Allreduce, Allgather, Reduce-Scatter, and Broadcast. I believe all of these operations have SHARP implementations in UCC.

I tried to force SHARP for all message sizes, but I found that it always uses its internal score map rather than the SHARP backend. For Allreduce and Broadcast, it follows the score map with a 4 KB threshold between UCP and SHARP. For Allgather and Reduce-Scatter, even though the score map threshold is 16 KB, it still seems to always use the UCP backend.

I’m wondering whether this is because my SHARP-forcing settings are incorrect, or because UCC internally decides between UCP and SHARP (or uses only UCP) based on its own selection logic.

I’ve attached my benchmark code and the Allgather output log.

HPC-X setup

export HPCX_HOME=$PWD
source $HPCX_HOME/hpcx-init.sh
hpcx_load

Benchmark Code

I used UCC_TL_SHARP_TUNE env var, but the score map was not changed.

mpirun -np $1 \
    --bind-to core \
    --map-by ppr:1:node \
    --hostfile ./hostfile-2 \
    -x LD_LIBRARY_PATH \
    -x SHARP_COLL_ENABLE_SAT=1 \
    -x SHARP_COLL_LOG_LEVEL=3 \
    -x UCC_CLS=basic \
    -x UCC_TL_SHARP_TUNE=allgather:inf \
    -x UCC_TL_SHARP_DEVICES=mlx5_0 \
    -x UCC_LOG_LEVEL=INFO \
    -x UCC_COLL_TRACE=INFO \
    -x OMPI_UCC_CL_BASIC_TLS=ucp,sharp \
    $HPCX_OSU_DIR/osu_allgather -x 20 -i 200 -m 4:8388608 > log.out

output - score map

You can see TL_SHARP is selected for {16K..inf}

[1765938844.536867] [-:20594:0]        ucc_team.c:471  UCC  INFO  ===== COLL_SCORE_MAP (team_id 32768, size 16) =====
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Allgather:
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..4095}:TL_UCP:10 {4K..16383}:TL_UCP:10 {16K..inf}:TL_SHARP:10 
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {16K..inf}:TL_SHARP:10 
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {16K..inf}:TL_SHARP:10 
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Rocm: {16K..inf}:TL_SHARP:10 
[1765938844.536882] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	RocmManaged: {16K..inf}:TL_SHARP:10 
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Allreduce:
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..4094}:CL_HIER:50 {4095..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10 
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {0..inf}:TL_SHARP:10 
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {0..inf}:TL_SHARP:10 
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Rocm: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10 
[1765938844.536907] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	RocmManaged: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10 
[1765938844.536924] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Alltoall:
[1765938844.536924] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..2063}:TL_UCP:10 {2064..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 
[1765938844.536930] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Alltoallv:
[1765938844.536930] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Barrier:
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:CL_HIER:50 
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {0..inf}:TL_SHARP:10 
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {0..inf}:TL_SHARP:10 
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Rocm: {0..inf}:TL_SHARP:10 
[1765938844.536936] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	RocmManaged: {0..inf}:TL_SHARP:10 
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Bcast:
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..4K}:CL_HIER:50 {4097..32767}:TL_SHARP:10 {32K..inf}:TL_SHARP:10 
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {0..inf}:TL_SHARP:10 
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {0..inf}:TL_SHARP:10 
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Rocm: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10 
[1765938844.536956] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	RocmManaged: {0..4K}:CL_HIER:50 {4097..inf}:TL_SHARP:10 
[1765938844.536971] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Fanin:
[1765938844.536971] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.536978] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Fanout:
[1765938844.536978] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.536983] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Gather:
[1765938844.536983] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.536990] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Gatherv:
[1765938844.536990] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Reduce:
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..4K}:CL_HIER:50 {4097..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10 
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {0..4K}:CL_HIER:50 
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {0..4K}:CL_HIER:50 
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Rocm: {0..4K}:CL_HIER:50 
[1765938844.536999] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	RocmManaged: {0..4K}:CL_HIER:50 
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Reduce_scatter:
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..16383}:TL_UCP:10 {16K..inf}:TL_SHARP:10 
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {16K..inf}:TL_SHARP:10 
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {16K..inf}:TL_SHARP:10 
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Rocm: {16K..inf}:TL_SHARP:10 
[1765938844.537017] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	RocmManaged: {16K..inf}:TL_SHARP:10 
[1765938844.537032] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Reduce_scatterv:
[1765938844.537032] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.537037] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  Scatterv:
[1765938844.537037] [-:20594:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..inf}:TL_UCP:10 
[1765938844.537041] [-:20594:0]        ucc_team.c:475  UCC  INFO  ================================================

Output - large message sizes

[1765938844.928066] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.928483] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.928489] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.928906] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.928914] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.929327] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929334] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allgather: src={0x14eb5765c000, 262144, int8, Host}, dst={0x14eb47fff000, 4194304, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.929747] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929753] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929766] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Reduce min root 0: src={0x7ffd0a0d89d0, 1, float64, Host}, dst={0x7ffd0a0d89e8, 1, float64, Host}; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929774] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Reduce max root 0: src={0x7ffd0a0d89d0, 1, float64, Host}, dst={0x7ffd0a0d89e0, 1, float64, Host}; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929781] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Reduce sum root 0: src={0x7ffd0a0d89d0, 1, float64, Host}, dst={0x7ffd0a0d89d8, 1, float64, Host}; CL_HIER {TL_UCP}, team_id 32768
262144                415.87
[1765938844.929794] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929799] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.929805] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allgather: src={0x14eb5765c000, 524288, int8, Host}, dst={0x14eb47fff000, 8388608, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1765938844.932148] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP}, team_id 32768
[1765938844.932161] [-:20594:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allgather: src={0x14eb5765c000, 524288, int8, Host}, dst={0x14eb47fff000, 8388608, int8, Host}; CL_BASIC {TL_UCP}, team_id 32768

Also, here's ucp_info -s result

Default CLs scores: basic=10 hier=50
Default TLs scores: mlx5=1 self=50 sharp=30 shm=100 ucp=10

I'm running this on 16 nodes allocation, not GPU only CPU hosts setting. Please let me know if I'm missing anything or need to provide any additional information.

Here are Log files for Allgather, Allreduce, Broadcast, and Reduce Scatter. Benchmark code is exactly same except collective name.

log-allgather.txt

log-allreduce.txt

log-bcast.txt

log-reduce_scatter.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions