This tutorial demonstrates how to plug a **custom collective algorithm** (an AllGather variant) into the MSCCL++ NCCL interposition / algorithm registration path and invoke it transparently via the standard NCCL API (`ncclAllGather`).
The example shows how to:
- Define a device kernel (
allgather) that usesPortChanneldevice handles to exchange data. - Wrap that kernel inside an algorithm class (
AllgatherAlgoBuilder) responsible for:- Connection discovery / proxy setup.
- Context key generation (so contexts can be reused / cached).
- Launch function binding (kernel wrapper executed when NCCL all-gather is called).
- Register the algorithm builder with the global
AlgorithmCollectionBuilderand install a selector deciding which implementation to return for a given collective request. - Run a multi-process (multi-rank) test using standard NCCL calls. The user program remains unchanged apart from initialization / registration code.
- (Optionally) Capture the sequence of
ncclAllGathercalls into a CUDA Graph for efficient replay.
Example source directory:
examples/customized-collective-algorithm/
Key file: customized_allgather.cu.
From the repository root:
cd examples/customized-collective-algorithm
makeRun (inside container you may need root privileges depending on GPU access):
LD_PRELOAD=<MSCCLPP_INSTALL_DIR>/lib/libmscclpp_nccl.so ./customized_allgatherExpected (abbreviated) output on success:
GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s
Succeed!