[BTL/Vader] Can Vader's Single-copy Mechanism Utilize RMDA Loopback Transport ?

## Background information
I am investigating the performance of various single-copy transport implemented in Vader BTL.  
The most relevant information I could find is an official announcement circa 2014: 
https://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy  

My background is not in networking, so I appreciate if you could give me some pointers for better understanding. 

### Open MPI version
Open MPI: 4.1.7rc1 (gc2cc293)

### Open MPI build
``` 
$ ompi_info 
...
Configure command line: 
    '--prefix=/apps/compiler/gcc/12.2.0/cudampi/12.3/openmpi/4.1.x'
    '--enable-mpi-cxx' '--with-slurm=/usr'
    '--with-pmi=/usr' '--with-cma'
    '--with-knem=/opt/knem-1.1.4.90mlnx3'
    '--with-xpmem=/apps/common/xpmem/2.7.4/'
    '--with-hcoll=/opt/mellanox/hcoll'
    '--with-ucx=/apps/common/ucx/1.17.0/cuda/12.3'
    '--with-cuda=/apps/cuda/12.3'
```

### System information
* Operating system/version: 7.9.2009
* Computer hardware: 2x Xeon(R) Gold 6140
* Network type: Mellanox HDR100 

### Methods and Results 
- The intra-node latency and bandwidth was conducted using OSU Benchmark v7.5
- Each mechanism was selected via ` btl_vader_single_copy_mechanism`. 
- Vader results are compared with UCX's implementation, which also utilized `cma`, `knem` and `xpmem` 
- The expectation is `cico` < `cma` < `knem` < `xpmem` < ucx 

```
mpirun                                      
  --np 2                                  
  --mca pml ^ucx                          
  --mca pml_ob1_verbose 100               
  --mca btl_base_verbose 100              
  --mca btl self,vader                    
  --mca opal_warn_on_missing_libcuda 0    
  --mca btl_vader_single_copy_mechanism {none|cma|knem|xpmem} 
  $PREFIX/osu_latency H H                 
``` 
Due to the clustering to data at small message limit. Please refer to the following table: 

| message size | none | cma | knem | xpmem | ucx |
|--------------|------|-----|------|-------|-----|
| 1 | 0.28 | 0.27 | 0.27 | 0.28 | 0.27 |
| 2 | 0.27 | 0.26 | 0.27 | 0.27 | 0.26 |
| 4 | 0.27 | 0.26 | 0.26 | 0.27 | 0.26 |
| 8 | 0.26 | 0.26 | 0.26 | 0.26 | 0.26 |
| 16 | 0.27 | 0.26 | 0.27 | 0.27 | 0.26 |
| 32 | 0.27 | 0.26 | 0.27 | 0.27 | 0.28 |
| 64 | 0.30 | 0.29 | 0.30 | 0.30 | 0.28 |
| 128 | 0.36 | 0.36 | 0.36 | 0.35 | 0.46 |
| 256 | 0.41 | 0.42 | 0.43 | 0.42 | 0.46 |
| 512 | 0.53 | 0.55 | 0.55 | 0.50 | 0.55 |
| 1024 | 0.62 | 0.66 | 0.65 | 0.51 | 0.60 |
| 2048 | 0.77 | 0.80 | 0.83 | 0.53 | 0.80 |
| 4096 | 1.39 | 1.48 | 1.54 | 0.58 | 1.16 |
| 8192 | 1.97 | 1.65 | 1.63 | 0.66 | 1.88 |
| 16384 | 3.35 | 2.16 | 2.02 | 1.04 | 2.89 |
| 32768 | 5.47 | 2.88 | 2.47 | 1.63 | 4.50 |
| 65536 | 7.10 | 4.31 | 3.33 | 2.54 | 7.58 |
| 131072 | 11.05 | 7.32 | 5.04 | 4.40 | 13.40 |
| 262144 | 19.25 | 13.17 | 8.65 | 8.12 | 13.29 |
| 524288 | 49.15 | 27.98 | 19.59 | 39.61 | 19.63 |
| 1048576 | 101.53 | 80.90 | 64.44 | 138.47 | 41.95 |
| 2097152 | 196.55 | 211.33 | 166.92 | 313.23 | 120.63 |
| 4194304 | 409.93 | 444.25 | 348.34 | 681.75 | 246.02 |

### KNEM vs. XPMEM at large message limit. 
Up until 256KB, `xpmem` offers the best latency (8.12us), following by `knem` (8.65us), and `cma` (13.17us), respectively. 
But at 2MB and 4MB limit, `xpmem` (681.75us)  is outperformed by `knem` (348.34us) and even `cico` (409.93us). 

By looking at the UCX's protocol selection, UCX also prefers `knem` over `xpmem`, e.g. 
```
[1746173000.065184] [skl02:244412:0]   +--------------------------------+------------------------------------------------------------------------------------------------+
[1746173000.065193] [skl02:244412:0]   | ucp_context_2 intra-node cfg#1 | tagged message by ucp_tag_send*(multi) from host memory                                        |
[1746173000.065198] [skl02:244412:0]   +--------------------------------+-----------------------------------------------+------------------------------------------------+
[1746173000.065203] [skl02:244412:0]   |                          0..92 | eager short                                   | sysv/memory                                    |
[1746173000.065207] [skl02:244412:0]   |                       93..5122 | eager copy-in copy-out                        | sysv/memory                                    |
[1746173000.065212] [skl02:244412:0]   |                    5123..13046 | (?) rendezvous copy from mapped remote memory | xpmem/memory                                   |
[1746173000.065216] [skl02:244412:0]   |                     13047..inf | (?) rendezvous zero-copy read from remote     | 71% on knem/memory and 29% on rc_mlx5/mlx5_0:1 |
[1746173000.065220] [skl02:244412:0]   +--------------------------------+-----------------------------------------------+------------------------------------------------+
```
Here, UCX splits the large message between `knem` and `rc_mlx5/mlx5_0:1` corresponding to loopback mechanism, if I understand correctly. 

So my questions are: 
- I know this is on a case-by-case basis. But is there a fundamental reason for the transition from `xpmem` to `knem` ? 
  The direct memory load/store should offer the best latency, unless there is an actual `memcpy` involves that is slower than kernel-assisted copy. 
  In other words, is this an intrinsic properties of the underlying  CPU ? For instance: 
  + Intel CPUs:  `knem` > `xpmem` 
  + AMD CPUs: 'knem' < 'xpmem` 
  The above could be hypothetically explained through ring-bus interconnect and infinity fabric for Intel, and AMD, respectively.  

- Can Vader utilize RMDA through Mellanox's loopback to improve latency of large message ? 
  Vader's MCA params offers the following parameter: 
 ```
MCA btl vader: parameter "btl_vader_cuda_rdma_limit" (current value: "18446744073709551615", data source: default, level: 5 tuner/detail, type: size_t)
MCA btl vader: parameter "btl_vader_rdma_pipeline_send_length" (current value: "32768", data source: default, level: 4 tuner/basic, type: size_t)
MCA btl vader: parameter "btl_vader_rdma_pipeline_frag_size" (current value: "32768", data source: default, level: 4 tuner/basic, type: size_t)
MCA btl vader: parameter "btl_vader_min_rdma_pipeline_size" (current value: "2147483647", data source: default, level: 4 tuner/basic, type: size_t)
```
 The idea of pipelining a fragment of message via RDMA is similar to UCX. Yet, I don't see them being utilized in the verbose message. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BTL/Vader] Can Vader's Single-copy Mechanism Utilize RMDA Loopback Transport ? #13223

Background information

Open MPI version

Open MPI build

System information

Methods and Results

KNEM vs. XPMEM at large message limit.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

message size	none	cma	knem	xpmem	ucx
1	0.28	0.27	0.27	0.28	0.27
2	0.27	0.26	0.27	0.27	0.26
4	0.27	0.26	0.26	0.27	0.26
8	0.26	0.26	0.26	0.26	0.26
16	0.27	0.26	0.27	0.27	0.26
32	0.27	0.26	0.27	0.27	0.28
64	0.30	0.29	0.30	0.30	0.28
128	0.36	0.36	0.36	0.35	0.46
256	0.41	0.42	0.43	0.42	0.46
512	0.53	0.55	0.55	0.50	0.55
1024	0.62	0.66	0.65	0.51	0.60
2048	0.77	0.80	0.83	0.53	0.80
4096	1.39	1.48	1.54	0.58	1.16
8192	1.97	1.65	1.63	0.66	1.88
16384	3.35	2.16	2.02	1.04	2.89
32768	5.47	2.88	2.47	1.63	4.50
65536	7.10	4.31	3.33	2.54	7.58
131072	11.05	7.32	5.04	4.40	13.40
262144	19.25	13.17	8.65	8.12	13.29
524288	49.15	27.98	19.59	39.61	19.63
1048576	101.53	80.90	64.44	138.47	41.95
2097152	196.55	211.33	166.92	313.23	120.63
4194304	409.93	444.25	348.34	681.75	246.02

[BTL/Vader] Can Vader's Single-copy Mechanism Utilize RMDA Loopback Transport ? #13223

Description

Background information

Open MPI version

Open MPI build

System information

Methods and Results

KNEM vs. XPMEM at large message limit.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions