Skip to content

[BTL/Vader] Can Vader's Single-copy Mechanism Utilize RMDA Loopback Transport ? #13223

@vitduck

Description

@vitduck

Background information

I am investigating the performance of various single-copy transport implemented in Vader BTL.
The most relevant information I could find is an official announcement circa 2014:
https://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy

My background is not in networking, so I appreciate if you could give me some pointers for better understanding.

Open MPI version

Open MPI: 4.1.7rc1 (gc2cc293)

Open MPI build

$ ompi_info 
...
Configure command line: 
    '--prefix=/apps/compiler/gcc/12.2.0/cudampi/12.3/openmpi/4.1.x'
    '--enable-mpi-cxx' '--with-slurm=/usr'
    '--with-pmi=/usr' '--with-cma'
    '--with-knem=/opt/knem-1.1.4.90mlnx3'
    '--with-xpmem=/apps/common/xpmem/2.7.4/'
    '--with-hcoll=/opt/mellanox/hcoll'
    '--with-ucx=/apps/common/ucx/1.17.0/cuda/12.3'
    '--with-cuda=/apps/cuda/12.3'

System information

  • Operating system/version: 7.9.2009
  • Computer hardware: 2x Xeon(R) Gold 6140
  • Network type: Mellanox HDR100

Methods and Results

  • The intra-node latency and bandwidth was conducted using OSU Benchmark v7.5
  • Each mechanism was selected via btl_vader_single_copy_mechanism.
  • Vader results are compared with UCX's implementation, which also utilized cma, knem and xpmem
  • The expectation is cico < cma < knem < xpmem < ucx
mpirun                                      
  --np 2                                  
  --mca pml ^ucx                          
  --mca pml_ob1_verbose 100               
  --mca btl_base_verbose 100              
  --mca btl self,vader                    
  --mca opal_warn_on_missing_libcuda 0    
  --mca btl_vader_single_copy_mechanism {none|cma|knem|xpmem} 
  $PREFIX/osu_latency H H                 

Due to the clustering to data at small message limit. Please refer to the following table:

message size none cma knem xpmem ucx
1 0.28 0.27 0.27 0.28 0.27
2 0.27 0.26 0.27 0.27 0.26
4 0.27 0.26 0.26 0.27 0.26
8 0.26 0.26 0.26 0.26 0.26
16 0.27 0.26 0.27 0.27 0.26
32 0.27 0.26 0.27 0.27 0.28
64 0.30 0.29 0.30 0.30 0.28
128 0.36 0.36 0.36 0.35 0.46
256 0.41 0.42 0.43 0.42 0.46
512 0.53 0.55 0.55 0.50 0.55
1024 0.62 0.66 0.65 0.51 0.60
2048 0.77 0.80 0.83 0.53 0.80
4096 1.39 1.48 1.54 0.58 1.16
8192 1.97 1.65 1.63 0.66 1.88
16384 3.35 2.16 2.02 1.04 2.89
32768 5.47 2.88 2.47 1.63 4.50
65536 7.10 4.31 3.33 2.54 7.58
131072 11.05 7.32 5.04 4.40 13.40
262144 19.25 13.17 8.65 8.12 13.29
524288 49.15 27.98 19.59 39.61 19.63
1048576 101.53 80.90 64.44 138.47 41.95
2097152 196.55 211.33 166.92 313.23 120.63
4194304 409.93 444.25 348.34 681.75 246.02

KNEM vs. XPMEM at large message limit.

Up until 256KB, xpmem offers the best latency (8.12us), following by knem (8.65us), and cma (13.17us), respectively.
But at 2MB and 4MB limit, xpmem (681.75us) is outperformed by knem (348.34us) and even cico (409.93us).

By looking at the UCX's protocol selection, UCX also prefers knem over xpmem, e.g.

[1746173000.065184] [skl02:244412:0]   +--------------------------------+------------------------------------------------------------------------------------------------+
[1746173000.065193] [skl02:244412:0]   | ucp_context_2 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from host memory                                        |
[1746173000.065198] [skl02:244412:0]   +--------------------------------+-----------------------------------------------+------------------------------------------------+
[1746173000.065203] [skl02:244412:0]   |                          0..92 | eager short                                   | sysv/memory                                    |
[1746173000.065207] [skl02:244412:0]   |                       93..5122 | eager copy-in copy-out                        | sysv/memory                                    |
[1746173000.065212] [skl02:244412:0]   |                    5123..13046 | (?) rendezvous copy from mapped remote memory | xpmem/memory                                   |
[1746173000.065216] [skl02:244412:0]   |                     13047..inf | (?) rendezvous zero-copy read from remote     | 71% on knem/memory and 29% on rc_mlx5/mlx5_0:1 |
[1746173000.065220] [skl02:244412:0]   +--------------------------------+-----------------------------------------------+------------------------------------------------+

Here, UCX splits the large message between knem and rc_mlx5/mlx5_0:1 corresponding to loopback mechanism, if I understand correctly.

So my questions are:

  • I know this is on a case-by-case basis. But is there a fundamental reason for the transition from xpmem to knem ?
    The direct memory load/store should offer the best latency, unless there is an actual memcpy involves that is slower than kernel-assisted copy.
    In other words, is this an intrinsic properties of the underlying CPU ? For instance:

    • Intel CPUs: knem > xpmem
    • AMD CPUs: 'knem' < 'xpmem`
      The above could be hypothetically explained through ring-bus interconnect and infinity fabric for Intel, and AMD, respectively.
  • Can Vader utilize RMDA through Mellanox's loopback to improve latency of large message ?
    Vader's MCA params offers the following parameter:

MCA btl vader: parameter "btl_vader_cuda_rdma_limit" (current value: "18446744073709551615", data source: default, level: 5 tuner/detail, type: size_t)
MCA btl vader: parameter "btl_vader_rdma_pipeline_send_length" (current value: "32768", data source: default, level: 4 tuner/basic, type: size_t)
MCA btl vader: parameter "btl_vader_rdma_pipeline_frag_size" (current value: "32768", data source: default, level: 4 tuner/basic, type: size_t)
MCA btl vader: parameter "btl_vader_min_rdma_pipeline_size" (current value: "2147483647", data source: default, level: 4 tuner/basic, type: size_t)

The idea of pipelining a fragment of message via RDMA is similar to UCX. Yet, I don't see them being utilized in the verbose message.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions