feat(mpi): implement host staging by dssgabriel · Pull Request #208 · kokkos/kokkos-comm

dssgabriel · 2026-01-20T17:12:06Z

This PR is based on #205, using the proposed host staging API.
It enables automatic host staging for the MPI backend, when the provided MPI implementation is not GPU-aware (controlled via a CMake option defined at config-time: -DKokkosComm_ENABLE_GPU_AWARE_MPI=ON).

To-Do list of interfaces to cover

P2P:

mpi::send
mpi::isend
mpi::recv
mpi::irecv

Colls:

Some questions/notes about host staging implementation:

For non-blocking "send-only" interfaces, do we need to extend the lifetime of the passed view? Technically, only the host-staged view needs to have its lifetime extended until the completion of the communication, not the provided view. E.g.:

auto host_sv = KokkosComm::Impl::stage_for(sv);
h.space().fence("fence host staging before `MPI_Isend`");

// Assume contiguous view
MPI_Isend(data_handle(host_sv), span(host_sv), datatype<MpiSpace, T>(), dest, tag, h.mpi_comm(), &req.mpi_request());
req.extend_view_lifetime(host_sv);

// NOTE: Is this really needed?
req.extend_view_lifetime(sv);

For non-contiguous "receive" interfaces, can we directly unpack into the passed view instead of the host-staged view? This would remove a call to deep_copy, which I think is smart enough to do the right thing, but I am not sure. E.g.:

auto host_rv = KokkosComm::Impl::stage_for(rv);
space.fence("fence host staging before `MPI_Recv`");

// Assume non-contiguous view 
auto packed = Packer::allocate_packed_for(space, "packed `MPI_Recv`", host_rv);
space.fence("fence packing before `MPI_Recv`");
MPI_Recv(data_handle(packed.view), packed.count, packed.datatype, src, tag, comm, MPI_STATUS_IGNORE);

// NOTE: Can we unpack directly into `rv` instead of `host_rv`
// and eliminate the subsequent call to `copy_back`?
Packer::unpack_into(space, host_rv, args.view);
KokkosComm::Impl::copy_back(space, rv, host_rv);

space.fence("fence copy back after `MPI_Recv`");

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

Make the implementation of `KokkosComm::mpi::broadcast` with an execution space parameter the "default". The overload without an exec space param only forwards to the former with a `Kokkos::DefaultExecutionSpace{}` instantiation. Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

Same as for `broadcast`

feat(mpi): add host staging for `iallgather` feat(mpi): add host staging for in-place `allgather`

Packer types change if we are in the GPU-aware path or not: the former is templated over the passed View type, the latter is templated over the host staged View type.

cedricchevalier19

Just some thoughts.

To me the biggest issue with staging is how we manage the memory (the same goes for packing).

I am not convinced that we have to wrap Kokkos mirror functions or deep_copy.

cedricchevalier19 · 2026-02-10T16:05:19Z

src/KokkosComm/mpi/allgather.hpp

+  auto host_sv = KokkosComm::Impl::stage_for(sv);
+  auto host_rv = KokkosComm::Impl::stage_for(rv);


Not sure if calling create_mirror_view_and_copy and create_mirror_view explicitly is not better

cedricchevalier19 · 2026-02-10T16:06:50Z

src/KokkosComm/mpi/allgather.hpp

+                 comm, &req.mpi_request());
+  // Implicitly extends lifetimes of `host_rv` and `rv` due to lambda capture
+  req.call_after_mpi_wait([=]() {
+    KokkosComm::Impl::copy_back(space, rv, host_rv);


Check if copy_back is needed or if we can directly use deep_copy (that should be no-op when the two views are the same).

Yes, this corresponds to my question in the PR description:

For non-contiguous "receive" interfaces, can we directly unpack into the passed view instead of the host-staged view? This would remove a call to deep_copy, which I think is smart enough to do the right thing, but I am not sure. E.g.:

auto host_rv = KokkosComm::Impl::stage_for(rv); space.fence("fence host staging before `MPI_Recv`"); // Assume non-contiguous view auto packed = Packer::allocate_packed_for(space, "packed `MPI_Recv`", host_rv); space.fence("fence packing before `MPI_Recv`"); MPI_Recv(data_handle(packed.view), packed.count, packed.datatype, src, tag, comm, MPI_STATUS_IGNORE); // NOTE: Can we unpack directly into `rv` instead of `host_rv` // and eliminate the subsequent call to `copy_back`? Packer::unpack_into(space, host_rv, args.view); KokkosComm::Impl::copy_back(space, rv, host_rv); space.fence("fence copy back after `MPI_Recv`");

I'll refactor it with a direct deep_copy to avoid the (unnecessary) intermediate operation.

cedricchevalier19 · 2026-02-10T16:07:32Z

src/KokkosComm/mpi/allgather.hpp

+    space.fence("fence copy back after `MPI_Iallgather`");
+  });
+  req.extend_view_lifetime(host_sv);
+  req.extend_view_lifetime(sv);


Why we need this?

We don't in the case of a host-staged send operation, since the view that is actually sent (and that needs to live long enough) is host_sv, not sv.
This could be safely removed, but our docs should clearly state what the semantics of KC calls are with respect to view reuse.

In the host-staged case, while sv is technically reusable by the user immediately after the KC collective is called, I think it would be better to have the same semantics in both execution paths, and to mandate that sv is reusable only after the comm operation completes (via wait, wait_all, wait_any, test, etc.).
This also aligns with MPI semantics w.r.t. non-blocking operations.

dssgabriel added 7 commits January 8, 2026 17:28

feat: add basic host staging API

7bca816

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

test: add basic unit tests for host staging API

29eb2f2

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

refactor: let create_mirror_view_and_copy do the smart thing

6ee26f2

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

fix: use correct template param to decide copying back or not

b8de091

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

fix: use correct memory space for views + GTEST_SKIP

4935ce8

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

feat: add CMake option to enable GPU-aware MPI impl at comptime

98cfa22

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

feat(mpi): add host staging for P2P

56c2042

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

dssgabriel requested a review from cedricchevalier19 January 20, 2026 17:12

dssgabriel self-assigned this Jan 20, 2026

dssgabriel added C-enhancement Category: an enhancement or bug fix E-help-wanted Call for participation: help is requested and/or extra attention is needed A-mpi Area: KokkosComm MPI backend implementation labels Jan 20, 2026

dssgabriel mentioned this pull request Jan 20, 2026

feat: design a generic memory space staging API #205

Draft

dssgabriel added this to the Version 0.1 milestone Jan 20, 2026

dssgabriel added 2 commits January 20, 2026 18:35

feat(mpi): add host staging for ibroadcast

05ffb54

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>

dssgabriel force-pushed the feature/impl-mpi-host-staging branch from 94f004c to 05ffb54 Compare January 20, 2026 22:38

dssgabriel added 8 commits January 21, 2026 11:25

refactor(mpi): make allgather with exec space the default impl

9656cca

Same as for `broadcast`

fix(mpi): add missing fences in broadcast and ibroadcast

b925d71

feat(mpi): add host staging for allgather

f1c16fa

feat(mpi): add host staging for `iallgather` feat(mpi): add host staging for in-place `allgather`

feat(mpi): add host staging for all A2A interfaces

682aa68

fix(mpi): var names in iallgather

3a4f177

fix(mpi): use correct Packer types for isend

1130542

Packer types change if we are in the GPU-aware path or not: the former is templated over the passed View type, the latter is templated over the host staged View type.

chore: format

0ed45af

fix(mpi): template param

205e9f0

cedricchevalier19 reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mpi): implement host staging#208

feat(mpi): implement host staging#208
dssgabriel wants to merge 17 commits intokokkos:developfrom
dssgabriel:feature/impl-mpi-host-staging

dssgabriel commented Jan 20, 2026 •

edited

Loading

Uh oh!

cedricchevalier19 left a comment

Uh oh!

cedricchevalier19 Feb 10, 2026

Uh oh!

cedricchevalier19 Feb 10, 2026

Uh oh!

dssgabriel Feb 10, 2026

Uh oh!

cedricchevalier19 Feb 10, 2026

Uh oh!

dssgabriel Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		auto host_sv = KokkosComm::Impl::stage_for(sv);
		auto host_rv = KokkosComm::Impl::stage_for(rv);

Conversation

dssgabriel commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cedricchevalier19 left a comment

Choose a reason for hiding this comment

Uh oh!

cedricchevalier19 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

cedricchevalier19 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

dssgabriel Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

cedricchevalier19 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

dssgabriel Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dssgabriel commented Jan 20, 2026 •

edited

Loading