Customized Collective Algorithm with NCCL API

This tutorial demonstrates how to plug a **custom collective algorithm** (an AllGather variant) into the MSCCL++ NCCL interposition / algorithm registration path and invoke it transparently via the standard NCCL API (`ncclAllGather`).

Overview

The example shows how to:

Define a device kernel (allgather) that uses PortChannel device handles to exchange data.
Wrap that kernel inside an algorithm class (AllgatherAlgoBuilder) responsible for:
- Connection discovery / proxy setup.
- Context key generation (so contexts can be reused / cached).
- Launch function binding (kernel wrapper executed when NCCL all-gather is called).
Register the algorithm builder with the global AlgorithmCollectionBuilder and install a selector deciding which implementation to return for a given collective request.
Run a multi-process (multi-rank) test using standard NCCL calls. The user program remains unchanged apart from initialization / registration code.
(Optionally) Capture the sequence of ncclAllGather calls into a CUDA Graph for efficient replay.

Location

Example source directory:

examples/customized-collective-algorithm/

Key file: customized_allgather.cu.

Build and Run

From the repository root:

cd examples/customized-collective-algorithm
make

Run (inside container you may need root privileges depending on GPU access):

LD_PRELOAD=<MSCCLPP_INSTALL_DIR>/lib/libmscclpp_nccl.so ./customized_allgather

Expected (abbreviated) output on success:

GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s
Succeed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customized Collective Algorithm with NCCL API

Overview

Location

Build and Run

FilesExpand file tree

customized-algorithm-with-nccl-api.md

Latest commit

History

customized-algorithm-with-nccl-api.md

File metadata and controls

Customized Collective Algorithm with NCCL API

Overview

Location

Build and Run