feat(mpi): implement host staging#208
Conversation
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Make the implementation of `KokkosComm::mpi::broadcast` with an
execution space parameter the "default". The overload without an exec
space param only forwards to the former with a
`Kokkos::DefaultExecutionSpace{}` instantiation.
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
94f004c to
05ffb54
Compare
Same as for `broadcast`
feat(mpi): add host staging for `iallgather` feat(mpi): add host staging for in-place `allgather`
Packer types change if we are in the GPU-aware path or not: the former is templated over the passed View type, the latter is templated over the host staged View type.
cedricchevalier19
left a comment
There was a problem hiding this comment.
Just some thoughts.
To me the biggest issue with staging is how we manage the memory (the same goes for packing).
I am not convinced that we have to wrap Kokkos mirror functions or deep_copy.
| auto host_sv = KokkosComm::Impl::stage_for(sv); | ||
| auto host_rv = KokkosComm::Impl::stage_for(rv); |
There was a problem hiding this comment.
Not sure if calling create_mirror_view_and_copy and create_mirror_view explicitly is not better
| comm, &req.mpi_request()); | ||
| // Implicitly extends lifetimes of `host_rv` and `rv` due to lambda capture | ||
| req.call_after_mpi_wait([=]() { | ||
| KokkosComm::Impl::copy_back(space, rv, host_rv); |
There was a problem hiding this comment.
Check if copy_back is needed or if we can directly use deep_copy (that should be no-op when the two views are the same).
There was a problem hiding this comment.
Yes, this corresponds to my question in the PR description:
- For non-contiguous "receive" interfaces, can we directly unpack into the passed view instead of the host-staged view? This would remove a call to
deep_copy, which I think is smart enough to do the right thing, but I am not sure. E.g.:auto host_rv = KokkosComm::Impl::stage_for(rv); space.fence("fence host staging before `MPI_Recv`"); // Assume non-contiguous view auto packed = Packer::allocate_packed_for(space, "packed `MPI_Recv`", host_rv); space.fence("fence packing before `MPI_Recv`"); MPI_Recv(data_handle(packed.view), packed.count, packed.datatype, src, tag, comm, MPI_STATUS_IGNORE); // NOTE: Can we unpack directly into `rv` instead of `host_rv` // and eliminate the subsequent call to `copy_back`? Packer::unpack_into(space, host_rv, args.view); KokkosComm::Impl::copy_back(space, rv, host_rv); space.fence("fence copy back after `MPI_Recv`");
I'll refactor it with a direct deep_copy to avoid the (unnecessary) intermediate operation.
| space.fence("fence copy back after `MPI_Iallgather`"); | ||
| }); | ||
| req.extend_view_lifetime(host_sv); | ||
| req.extend_view_lifetime(sv); |
There was a problem hiding this comment.
We don't in the case of a host-staged send operation, since the view that is actually sent (and that needs to live long enough) is host_sv, not sv.
This could be safely removed, but our docs should clearly state what the semantics of KC calls are with respect to view reuse.
In the host-staged case, while sv is technically reusable by the user immediately after the KC collective is called, I think it would be better to have the same semantics in both execution paths, and to mandate that sv is reusable only after the comm operation completes (via wait, wait_all, wait_any, test, etc.).
This also aligns with MPI semantics w.r.t. non-blocking operations.
This PR is based on #205, using the proposed host staging API.
It enables automatic host staging for the MPI backend, when the provided MPI implementation is not GPU-aware (controlled via a CMake option defined at config-time:
-DKokkosComm_ENABLE_GPU_AWARE_MPI=ON).To-Do list of interfaces to cover
P2P:
mpi::sendmpi::isendmpi::recvmpi::irecvColls:
mpi::broadcastmpi::ibroadcastmpi::allgathermpi::iallgathermpi::allreducempi::iallreducempi::reducempi::ireduce(depends on feat(mpi): add non-blocking reduce (mpi:ireduce) #198)mpi::alltoallmpi::ialltoallmpi::inclusive_scanmpi::exclusive_scanSome questions/notes about host staging implementation:
deep_copy, which I think is smart enough to do the right thing, but I am not sure. E.g.: