Skip to content

Conversation

@edgargabriel
Copy link
Member

Add support for more functions to the coll/accelerator component. Specifically:

  • MPI_Reduce_scatter was missing as the only reduction operation
  • introduce support for Bcast, Allgather, and Alltoall operations. The max. msg size for which to use the copy-through-CPU mechanism can be controlled by MCA parameters for each function separatly.

To motivate the work on non-reduction operations, the main benefit is for short messages, since communication from GPU buffers has typically a higher latency than communication from CPU buffers.

Here is some data to support the PR, gathered on an MI250X node with 16 GPUs:

MPI_Bcast:

# OSU MPI-ROCM Broadcast Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16         16         32       32
# copy to CPU?            no        yes         no      yes  
1                      12.24       2.84      14.86     3.44
2                      17.19       4.18	     21.37     4.64
4                      17.22       4.13	     21.13     4.65
8                      17.19       4.09	     21.08     4.63
16                     17.23       4.11	     21.11     4.63
32                     17.47       4.18	     28.43     4.71
64                     20.02       4.15	     28.03     4.74
128                    27.54       8.37	     45.03     8.05
256                    28.30      12.67	     44.73    13.38
512                    27.65      32.51	     44.65    28.39
1024                   27.86      57.93	     44.92    64.72
2048                   28.06      27.66	    207.97    34.89
4096                   54.40      32.14	    429.97    36.70
8192                   54.64      45.65	    430.32    57.30
16384                  55.76      56.71	    433.16    67.10
32768                  56.96      84.60	    459.17    93.21
65536                  59.84     141.01	    440.84   162.06
131072                 65.34     256.07	    466.58   279.53
262144                 76.69     474.92	    553.10   524.82
524288                 97.60     803.38	    674.06   955.30
1048576               271.92    2030.49	   1069.67  1844.79

MPI_Allgather:

# OSU MPI-ROCM Allgather Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16        16        32         32
# Copy to CPU?            no       yes        no        yes
1                      44.70      5.86     54.62       6.85
2                      45.41      8.68	   54.67       9.67
4                      45.88      8.76	   54.09       9.71
8                      44.98      8.84	   66.06      10.12
16                     59.17      9.18	   83.69      10.70
32                     72.29      9.69	   99.86      11.57
64                     83.31     10.35	  120.70      12.65
128                    95.16     16.67	  148.10      19.50
256                    95.92     28.52	  148.03      33.09
512                    95.41     72.15	  147.25      86.03
1024                  111.54    144.25	  393.00     146.43
2048                  112.29     84.30	  761.04     105.84
4096                  159.91    111.35	  794.44     132.48
8192                  163.18    164.20	  801.66     188.61
16384                 169.69    277.55	  806.56     326.65
32768                 178.70    544.21	  810.47     622.84
65536                 195.38    891.83	 1661.78    1925.92
131072                226.69   1527.13	 1643.53    9896.87
262144                298.12   6354.89	 1667.39   19147.58
524288                450.23  12872.46	 1910.64   40723.25
1048576               764.79  26134.81	 3752.21   81590.10

MPI_Alltoall:

# OSU MPI-ROCM All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16        16        32        32
# copy to CPU?            no       yes        no       yes
1                     102.40     62.82    109.26     39.49
2                     119.86     65.76	  131.27     41.89
4                      70.86     27.01	  135.16     41.77
8                      75.00     26.99	  147.80     41.83
16                     88.56     27.45	  175.73     42.33
32                    112.92     27.56	  243.95     42.88
64                    184.78     27.92	  409.48     44.33
128                   363.13     34.33	  611.12     53.13
256                   484.21     46.77	  948.13     70.85
512                   205.33    135.85	  489.44    235.73
1024                  184.16    196.80	  482.84    295.78
2048                  182.26    154.94	  487.60    296.06
4096                  187.28    163.12	  538.29    311.60
8192                  189.87    184.98	  539.06    346.44
16384                 190.54    232.64	  544.14    431.32
32768                 191.31    336.01	  549.42    533.02
65536                 197.93    492.32	  557.00   1050.14
131072                199.14    889.92	  583.84  11395.43
262144                244.68   7316.06	  843.04  21582.68
524288                389.49  13314.65	 1481.85  40945.10
1048576               697.91  24498.35	 2753.87  80627.56

This data demonstrates that there are benefits of using the copy-to-host approach especially for short messages. Results on different settings might vary, but the user has the option to control the desired behavior using the MCA parameter (and later using #12991 by explicitly specifying which collective operation this component should be registered for.

@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from ae0525b to 94a56ab Compare December 30, 2024 17:06
@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 8658265 to 5e0bcd7 Compare December 30, 2024 17:22
/* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs
CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes,
but could vary due to layout issues/gaps for derived datatypes */
if ((rc > 0) && ((sbufsize * comm_size) <= (size_t)mca_coll_accelerator_alltoall_thresh)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem here: at higher process counts we would tend to use the device memory for many small send/recv. I'd expect that the overhead for small transfers becomes more pronounced for small messages at higher process counts. Why do we need to factor in the process count at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been honestly going back and forth on whether to use the parameter as a per process parameter or a parameter that limits the amount of temporary memory allocated on the CPU side. I am happy to use this as a per-process parameter (i.e. without communicator size playing a role). My original (maybe incorrect) line of thinking had to do with the v-versions of these operations (that this commit doesn't tackle yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the parameter to be a per-process threshold (vs. the overall buffer size)

mca_coll_accelerator_allgather_thresh = 65536;
(void) mca_base_component_var_register(&mca_coll_accelerator_component.super.collm_version,
"allgather_thresh",
"max. overall msg length for which to copy accelerator buffer to CPU for allgather operation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that overall msg length communicates well that the threshold is compared to send_size*comm_size. See my comment above for my concerns about using the process count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the word 'overall' with the adjustment of the meaning of the parameter . Thank you for the review!

@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 5e0bcd7 to 167b8b2 Compare January 2, 2025 15:03
@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 167b8b2 to d9af881 Compare January 27, 2025 19:12
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a needed update. However, I have 2 issues with this PR:

  • conceptually working around the span of the datatype to move data back and forth between the CPU and GPU will ensure we overwrite all the gaps. This leads to unexpected behaviors, if a GPU kernel changes the values in the gaps.
  • we need to be careful to move the correct amount of data, which must include the size of the communicator for some operations.

I don't have a better idea on how to address these. Either we document it clearly, or we use the pack functions and only support homogeneous environments.

}
if ((MPI_IN_PLACE != sbuf) && (rc > 0) &&
(sbufsize <= (size_t)mca_coll_accelerator_allgather_thresh)) {
sbuf1 = (char*)malloc(sbufsize * comm_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in an allgather, same buffer goes to all ranks. Why allocating sbufsize * comm_size?

}

if (NULL != rbuf1) {
mca_coll_accelerator_memcpy(rbuf2, rbuf_dev, rbuf1, MCA_ACCELERATOR_NO_DEVICE_ID, rbufsize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Times comm_size for the h2d tranfer.

goto exit;;
}
if (NULL != rbuf1) {
mca_coll_accelerator_memcpy(rbuf2, rbuf_dev, rbuf1, MCA_ACCELERATOR_NO_DEVICE_ID, rbufsize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Times comm_size for the h2d transfer ?

edgargabriel and others added 3 commits January 28, 2025 14:14
add support for MPI_Reduce_scatter

Signed-off-by: Edgar Gabriel <[email protected]>
add support for bcast, allgather and alltoall for device buffers using
a temporary buffer on the CPU. The maximum msg length for each operation
for which to use this approach can be controlled through an mca
parameter.

Signed-off-by: Edgar Gabriel <[email protected]>
owner file update to indicate that the component is active and ownership
is shared between NVidia and AMD.

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from d9af881 to 8731f21 Compare January 28, 2025 14:28
@edgargabriel
Copy link
Member Author

Just to make sure I don't miss other items such as the update of the accelerator_memcpy length after the change of the meaning in the threshold value, I re-run our the validation testsuite and the osu-benchmarks with data validation turned on, and it looks good for the new functions as far as I can tell.

@edgargabriel edgargabriel merged commit c5e02ab into open-mpi:main Jan 29, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants