-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Background information
I am investigating the performance of various single-copy transport implemented in Vader BTL.
The most relevant information I could find is an official announcement circa 2014:
https://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
My background is not in networking, so I appreciate if you could give me some pointers for better understanding.
Open MPI version
Open MPI: 4.1.7rc1 (gc2cc293)
Open MPI build
$ ompi_info
...
Configure command line:
'--prefix=/apps/compiler/gcc/12.2.0/cudampi/12.3/openmpi/4.1.x'
'--enable-mpi-cxx' '--with-slurm=/usr'
'--with-pmi=/usr' '--with-cma'
'--with-knem=/opt/knem-1.1.4.90mlnx3'
'--with-xpmem=/apps/common/xpmem/2.7.4/'
'--with-hcoll=/opt/mellanox/hcoll'
'--with-ucx=/apps/common/ucx/1.17.0/cuda/12.3'
'--with-cuda=/apps/cuda/12.3'
System information
- Operating system/version: 7.9.2009
- Computer hardware: 2x Xeon(R) Gold 6140
- Network type: Mellanox HDR100
Methods and Results
- The intra-node latency and bandwidth was conducted using OSU Benchmark v7.5
- Each mechanism was selected via
btl_vader_single_copy_mechanism. - Vader results are compared with UCX's implementation, which also utilized
cma,knemandxpmem - The expectation is
cico<cma<knem<xpmem< ucx
mpirun
--np 2
--mca pml ^ucx
--mca pml_ob1_verbose 100
--mca btl_base_verbose 100
--mca btl self,vader
--mca opal_warn_on_missing_libcuda 0
--mca btl_vader_single_copy_mechanism {none|cma|knem|xpmem}
$PREFIX/osu_latency H H
Due to the clustering to data at small message limit. Please refer to the following table:
| message size | none | cma | knem | xpmem | ucx |
|---|---|---|---|---|---|
| 1 | 0.28 | 0.27 | 0.27 | 0.28 | 0.27 |
| 2 | 0.27 | 0.26 | 0.27 | 0.27 | 0.26 |
| 4 | 0.27 | 0.26 | 0.26 | 0.27 | 0.26 |
| 8 | 0.26 | 0.26 | 0.26 | 0.26 | 0.26 |
| 16 | 0.27 | 0.26 | 0.27 | 0.27 | 0.26 |
| 32 | 0.27 | 0.26 | 0.27 | 0.27 | 0.28 |
| 64 | 0.30 | 0.29 | 0.30 | 0.30 | 0.28 |
| 128 | 0.36 | 0.36 | 0.36 | 0.35 | 0.46 |
| 256 | 0.41 | 0.42 | 0.43 | 0.42 | 0.46 |
| 512 | 0.53 | 0.55 | 0.55 | 0.50 | 0.55 |
| 1024 | 0.62 | 0.66 | 0.65 | 0.51 | 0.60 |
| 2048 | 0.77 | 0.80 | 0.83 | 0.53 | 0.80 |
| 4096 | 1.39 | 1.48 | 1.54 | 0.58 | 1.16 |
| 8192 | 1.97 | 1.65 | 1.63 | 0.66 | 1.88 |
| 16384 | 3.35 | 2.16 | 2.02 | 1.04 | 2.89 |
| 32768 | 5.47 | 2.88 | 2.47 | 1.63 | 4.50 |
| 65536 | 7.10 | 4.31 | 3.33 | 2.54 | 7.58 |
| 131072 | 11.05 | 7.32 | 5.04 | 4.40 | 13.40 |
| 262144 | 19.25 | 13.17 | 8.65 | 8.12 | 13.29 |
| 524288 | 49.15 | 27.98 | 19.59 | 39.61 | 19.63 |
| 1048576 | 101.53 | 80.90 | 64.44 | 138.47 | 41.95 |
| 2097152 | 196.55 | 211.33 | 166.92 | 313.23 | 120.63 |
| 4194304 | 409.93 | 444.25 | 348.34 | 681.75 | 246.02 |
KNEM vs. XPMEM at large message limit.
Up until 256KB, xpmem offers the best latency (8.12us), following by knem (8.65us), and cma (13.17us), respectively.
But at 2MB and 4MB limit, xpmem (681.75us) is outperformed by knem (348.34us) and even cico (409.93us).
By looking at the UCX's protocol selection, UCX also prefers knem over xpmem, e.g.
[1746173000.065184] [skl02:244412:0] +--------------------------------+------------------------------------------------------------------------------------------------+
[1746173000.065193] [skl02:244412:0] | ucp_context_2 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from host memory |
[1746173000.065198] [skl02:244412:0] +--------------------------------+-----------------------------------------------+------------------------------------------------+
[1746173000.065203] [skl02:244412:0] | 0..92 | eager short | sysv/memory |
[1746173000.065207] [skl02:244412:0] | 93..5122 | eager copy-in copy-out | sysv/memory |
[1746173000.065212] [skl02:244412:0] | 5123..13046 | (?) rendezvous copy from mapped remote memory | xpmem/memory |
[1746173000.065216] [skl02:244412:0] | 13047..inf | (?) rendezvous zero-copy read from remote | 71% on knem/memory and 29% on rc_mlx5/mlx5_0:1 |
[1746173000.065220] [skl02:244412:0] +--------------------------------+-----------------------------------------------+------------------------------------------------+
Here, UCX splits the large message between knem and rc_mlx5/mlx5_0:1 corresponding to loopback mechanism, if I understand correctly.
So my questions are:
-
I know this is on a case-by-case basis. But is there a fundamental reason for the transition from
xpmemtoknem?
The direct memory load/store should offer the best latency, unless there is an actualmemcpyinvolves that is slower than kernel-assisted copy.
In other words, is this an intrinsic properties of the underlying CPU ? For instance:- Intel CPUs:
knem>xpmem - AMD CPUs: 'knem' < 'xpmem`
The above could be hypothetically explained through ring-bus interconnect and infinity fabric for Intel, and AMD, respectively.
- Intel CPUs:
-
Can Vader utilize RMDA through Mellanox's loopback to improve latency of large message ?
Vader's MCA params offers the following parameter:
MCA btl vader: parameter "btl_vader_cuda_rdma_limit" (current value: "18446744073709551615", data source: default, level: 5 tuner/detail, type: size_t)
MCA btl vader: parameter "btl_vader_rdma_pipeline_send_length" (current value: "32768", data source: default, level: 4 tuner/basic, type: size_t)
MCA btl vader: parameter "btl_vader_rdma_pipeline_frag_size" (current value: "32768", data source: default, level: 4 tuner/basic, type: size_t)
MCA btl vader: parameter "btl_vader_min_rdma_pipeline_size" (current value: "2147483647", data source: default, level: 4 tuner/basic, type: size_t)
The idea of pipelining a fragment of message via RDMA is similar to UCX. Yet, I don't see them being utilized in the verbose message.