refactor!: rename and rewrite Handle impls#220
refactor!: rename and rewrite Handle impls#220dssgabriel wants to merge 30 commits intokokkos:developfrom
Handle impls#220Conversation
Simple wrapper of `int` with UDL for easy creation
Makes the code much easier to read for functions with attributes and long type/param/tparam names.
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Handle implsHandle impls
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
Signed-off-by: Gabriel Dos Santos <gabriel.dossantos@cea.fr>
09142a7 to
fc8687f
Compare
`do_iteration` was taking its variadic arguments by value, causing the compiler to attempt a copy of any argument passed to it. This triggered a compile error when passing a `Communicator`, whose copy constructor is deleted by design. Fixed by taking `Args&&...` and forwarding with `std::forward<Args>(args)...`.
Correctly replace `exec_` with the passed in `exec` parameter.
Directly call the factory `split` instead of factory `duplicate`, since member functions can access their rank without having to recompute it.
f5d286b to
d6461fb
Compare
|
I cannot test on a NCCL system right now, but everything passes for both Open MPI (5.0.9) and MPICH (5.0.0) for the MPI backend (non-CUDA enabled) on my machine. |
For both MPI and NCCL specializations, covers: - Factory functions (`from_raw`, `duplicate`, `split`); - Accessors (`comm`, `size`, `rank`); - Move semantics.
98ec768 to
ae80875
Compare
|
@cwpearson Can you enable the SNL CI so I can fix any outstanding issues before making this ready for review? |
Some drive-by updates: - Removed mentions of `barrier` in the core API - Fixed mentions of ~~`Req`~~ -> `Request`
cedricchevalier19
left a comment
There was a problem hiding this comment.
I am wondering if we can get rid of the non owning aspect for the Communicator.
The code and the semantics may be simpler.
The other thing is the new Communicator should be used by all public functions of Kokkos Comm that use a communicator. So even the advanced functions in ::mpi::, should not take a MPI_Comm.
docs/api/core_recv.cpp
Outdated
| // Create a handle | ||
| KokkosComm::Handle<> handle; // Same as Handle<Execspace, CommSpace> | ||
| // Create a communicator | ||
| auto comm = KokkosComm::Communicator<Co, Ex>::duplicate(raw_comm_handle, exec_space); |
There was a problem hiding this comment.
For later, we should deduce the correct template parameters from the arguments. Or do we make a function that duplicates a Communicator.
There was a problem hiding this comment.
Unfortunately, we cannot deduce without taking an MpiSpace tag as argument, since we cannot define CTAD-like rules for static member functions.
An alternative could be the free function you mention, but I don't know about that. The name would need to reflect that it operates on a communicator, which isn't as clean as having a static member function.
One easy improvement we can make is to default the template parameters of the Communicator:
template <
CommunicationSpace Comm = DefaultCommunicationSpace,
KokkosExecutionSpace Exec = Kokkos::DefaultExecSpace>
struct Communicator;which allows for simpler "default" code:
auto comm = KokkosComm::Communicator<>::duplicate(raw_comm_handle, exec);
TLDR: I think we should keep it for now. I tend to agree, but this makes it harder for existing MPI-based codes to enter the Kokkos Comm "ecosystem". Namely, for libraries or apps that already manage their own MPI communicator (create/dup/split + free), having the Some design suggestions, though:
I agree. I have other PRs coming that clean up the p2p and colls interfaces (as part of #200) which are in line with this. |
This could lead to a compilation error if both MPI and NCCL backends were enabled at the same time, where `DefaultCommunicationSpace` would be defined twice. Now, we ensure it is only defined once. NCCL takes precedence over MPI if it has been enabled.
Update tests and docs. Remove `std::optional` from `from_raw` signature.
f46a6e3 to
72038ae
Compare
|
Tested on my machine with MPI + CUDA and everything works fine. |
Add unit tests for exec() accessor, self-move assignment safety, duplicate independence chain, and sequential splits.
|
I added a few more unit tests for Communicators and got them to run with the NCCL backend on a GPU system. This is ready for review and merge @cedricchevalier19. |
cedricchevalier19
left a comment
There was a problem hiding this comment.
One possible improvement: rename comm to mpi_comm in the internal MPI API.
Not blocking.
| auto req1 = KokkosComm::send(comm, data, dst_rank); | ||
|
|
||
| // Initiate a non-blocking send with a default handle | ||
| auto req2 = send(data, dest); | ||
| // Simulate a blocking send by waiting immediately | ||
| KokkosComm::send(comm, data, dst_rank).wait(); | ||
|
|
||
| // Wait for the requests to complete (assuming a wait function exists) | ||
| // Wait for a request to complete | ||
| KokkosComm::wait(req1); |
There was a problem hiding this comment.
For later, @hmt23 can you confirm this is also correct from a MPI point-of-view?
I think we can add an overload for it, but not outright rename it, or we will lose the "generic" interface between Communicator specializations (i.e., users will need to know which one they are using if they want to retrieve the raw comm handle) |
I was thinking about renaming the parameters, no change in the API. |
b5d7f83 to
9447ef8
Compare
Description
Similar to the recent work on requests (#216), this PR attempts to rename and rewrite the implementations of
Handles, now known asCommunicators, in view of stabilizing the Kokkos Comm API for our first official release.Additional context
Motivation
These changes attempt to stabilize Kokkos Comm's public API for "handles". In particular, it aims at:
Implementation notes
Unified interface across backends:
The central design constraint was that
Communicator<Ex, MpiSpace>andCommunicator<Ex, NcclSpace>must expose identical public interfaces, so that code written against the abstraction does not need to branch on the backend.Factory functions instead of constructors:
Construction is intentionally not possible via public constructors because there are multiple ways to create communicators. Static "factory" member functions (
from_raw,duplicate_from_raw,split_from_raw) were chosen for three reasons:std::optionalto force callers to handle failure without exceptions; andCommunicatorin an invalid state.A builder pattern was considered but rejected — with only three construction modes and minimal configuration, a builder would add object lifetime complexity for little real benefit, and in the long run, could promote backend-specific setter methods that would gradually erode the generic interface.
Non-owning
from_raw:from_rawnever takes ownership of the raw handle it is given. If a user created a raw communicator, they opted out of the abstraction layer and retain responsibility for its lifetime. Allowing optional ownership transfer would reintroduce a boolean parameter at the call site and recreate the exact class of double-free and leak bugs the design is meant to prevent. The ownership table is simple and memorable: onlyduplicateandsplitproduce owned communicators, because only they allocate a new backend handle internally.Explicit copies, move-only by default:
The copy constructor and copy assignment operator are deleted. Duplicating a communicator is an explicit MPI/NCCL operation with observable side effects and a potential for failure; silently copying a
Communicatoras though it were a value type would hide that cost entirely. Users who need a copy callduplicate(), which makes the intent and the cost visible. Move semantics are implemented manually rather than defaulted: the move constructor usesstd::exchangeto null out the source handle and setowned_tofalse, ensuring the moved-from object's destructor is always a no-op and preventing double-free.Changes
->HandleCommunicatoracross the entire code base, including file names, type aliases, and documentationCommunicatorspecializations with a unified interfaceduplicate_from_rawandsplit_from_rawfactory operations for both backends, each accepting either a raw backend handle. Also support duplicating/splitting from an existingKokkosComm::Communicatorusingduplicateandsplitmember functions.from_rawfactory function, which borrows a raw backend handle without taking ownershipexec(),comm(),size(),rank()duplicate()for explicit copiesCommunicatorAPICommunicator, covering factory functions, member functions, accessors, and move semantics for both MPI and NCCL backendsCommunicatorobjectclang-formatconfig to avoid bin-packing arguments, improving readability of functions with long signatures (especially factory functions with attributes, default values, etc.)Drive-by changes
Checklist