Skip to content

feat(mpi): implement host staging#208

Open
dssgabriel wants to merge 17 commits intokokkos:developfrom
dssgabriel:feature/impl-mpi-host-staging
Open

feat(mpi): implement host staging#208
dssgabriel wants to merge 17 commits intokokkos:developfrom
dssgabriel:feature/impl-mpi-host-staging

Conversation

@dssgabriel
Copy link
Copy Markdown
Collaborator

@dssgabriel dssgabriel commented Jan 20, 2026

This PR is based on #205, using the proposed host staging API.
It enables automatic host staging for the MPI backend, when the provided MPI implementation is not GPU-aware (controlled via a CMake option defined at config-time: -DKokkosComm_ENABLE_GPU_AWARE_MPI=ON).

To-Do list of interfaces to cover

P2P:

  • mpi::send
  • mpi::isend
  • mpi::recv
  • mpi::irecv

Colls:

Some questions/notes about host staging implementation:

  • For non-blocking "send-only" interfaces, do we need to extend the lifetime of the passed view? Technically, only the host-staged view needs to have its lifetime extended until the completion of the communication, not the provided view. E.g.:
    auto host_sv = KokkosComm::Impl::stage_for(sv);
    h.space().fence("fence host staging before `MPI_Isend`");
    
    // Assume contiguous view
    MPI_Isend(data_handle(host_sv), span(host_sv), datatype<MpiSpace, T>(), dest, tag, h.mpi_comm(), &req.mpi_request());
    req.extend_view_lifetime(host_sv);
    
    // NOTE: Is this really needed?
    req.extend_view_lifetime(sv);
  • For non-contiguous "receive" interfaces, can we directly unpack into the passed view instead of the host-staged view? This would remove a call to deep_copy, which I think is smart enough to do the right thing, but I am not sure. E.g.:
    auto host_rv = KokkosComm::Impl::stage_for(rv);
    space.fence("fence host staging before `MPI_Recv`");
    
    // Assume non-contiguous view 
    auto packed = Packer::allocate_packed_for(space, "packed `MPI_Recv`", host_rv);
    space.fence("fence packing before `MPI_Recv`");
    MPI_Recv(data_handle(packed.view), packed.count, packed.datatype, src, tag, comm, MPI_STATUS_IGNORE);
    
    // NOTE: Can we unpack directly into `rv` instead of `host_rv`
    // and eliminate the subsequent call to `copy_back`?
    Packer::unpack_into(space, host_rv, args.view);
    KokkosComm::Impl::copy_back(space, rv, host_rv);
    
    space.fence("fence copy back after `MPI_Recv`");

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
@dssgabriel dssgabriel self-assigned this Jan 20, 2026
@dssgabriel dssgabriel added C-enhancement Category: an enhancement or bug fix E-help-wanted Call for participation: help is requested and/or extra attention is needed A-mpi Area: KokkosComm MPI backend implementation labels Jan 20, 2026
@dssgabriel dssgabriel added this to the Version 0.1 milestone Jan 20, 2026
Make the implementation of `KokkosComm::mpi::broadcast` with an
execution space parameter the "default". The overload without an exec
space param only forwards to the former with a
`Kokkos::DefaultExecutionSpace{}` instantiation.

Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
@dssgabriel dssgabriel force-pushed the feature/impl-mpi-host-staging branch from 94f004c to 05ffb54 Compare January 20, 2026 22:38
feat(mpi): add host staging for `iallgather`

feat(mpi): add host staging for in-place `allgather`
Packer types change if we are in the GPU-aware path or not: the former
is templated over the passed View type, the latter is templated over the
host staged View type.
Copy link
Copy Markdown
Member

@cedricchevalier19 cedricchevalier19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some thoughts.

To me the biggest issue with staging is how we manage the memory (the same goes for packing).

I am not convinced that we have to wrap Kokkos mirror functions or deep_copy.

Comment on lines +42 to +43
auto host_sv = KokkosComm::Impl::stage_for(sv);
auto host_rv = KokkosComm::Impl::stage_for(rv);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if calling create_mirror_view_and_copy and create_mirror_view explicitly is not better

comm, &req.mpi_request());
// Implicitly extends lifetimes of `host_rv` and `rv` due to lambda capture
req.call_after_mpi_wait([=]() {
KokkosComm::Impl::copy_back(space, rv, host_rv);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if copy_back is needed or if we can directly use deep_copy (that should be no-op when the two views are the same).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this corresponds to my question in the PR description:

  • For non-contiguous "receive" interfaces, can we directly unpack into the passed view instead of the host-staged view? This would remove a call to deep_copy, which I think is smart enough to do the right thing, but I am not sure. E.g.:
auto host_rv = KokkosComm::Impl::stage_for(rv);
space.fence("fence host staging before `MPI_Recv`");

// Assume non-contiguous view 
auto packed = Packer::allocate_packed_for(space, "packed `MPI_Recv`", host_rv);
space.fence("fence packing before `MPI_Recv`");
MPI_Recv(data_handle(packed.view), packed.count, packed.datatype, src, tag, comm, MPI_STATUS_IGNORE);

// NOTE: Can we unpack directly into `rv` instead of `host_rv`
// and eliminate the subsequent call to `copy_back`?
Packer::unpack_into(space, host_rv, args.view);
KokkosComm::Impl::copy_back(space, rv, host_rv);

space.fence("fence copy back after `MPI_Recv`");

I'll refactor it with a direct deep_copy to avoid the (unnecessary) intermediate operation.

space.fence("fence copy back after `MPI_Iallgather`");
});
req.extend_view_lifetime(host_sv);
req.extend_view_lifetime(sv);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't in the case of a host-staged send operation, since the view that is actually sent (and that needs to live long enough) is host_sv, not sv.
This could be safely removed, but our docs should clearly state what the semantics of KC calls are with respect to view reuse.

In the host-staged case, while sv is technically reusable by the user immediately after the KC collective is called, I think it would be better to have the same semantics in both execution paths, and to mandate that sv is reusable only after the comm operation completes (via wait, wait_all, wait_any, test, etc.).
This also aligns with MPI semantics w.r.t. non-blocking operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-mpi Area: KokkosComm MPI backend implementation C-enhancement Category: an enhancement or bug fix E-help-wanted Call for participation: help is requested and/or extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants