MinTCCL is an attempt on using pure Python (Triton + PyTorch) to implement a minimalistic NCCL in ~1,200 line of Python. The author envision it as:
- a development repo that you can easily hack, deploy and fuse the collectives with other Triton kernels.
- an educational tutorial that you can read through (clean codes and comments) to learn how we gradually implement advanced AlltoAll starting from simplest SendRecv.
The structure is too simple to use folders, the core lies in four files:
common.pyimplements common utilities and configurations via envariables.primitive.pyimplements lacked device built-ins (likesyncwarp) and 128-bit Ops.protocol.pyimplements NCCL's LL128 protocol primitives, relies onprimitive.py.topology.pyimplements topology and routing algorithms, e.g., ring.
Based on these four modules, we can build MPI/NCCL Communication Collectives:
-
1_sendrecv_direct.py: simplest load/store for send/recv in direct link -
2_sendrecv_indirect.py: use LL128 protocol to support indirect link (multi-hop) -
3_sendrecv_buff.py: use buffer management to save memory usage -
4_broadcast.py: 1st collective using ring algorithm -
5_reduce.py: 1st computation and deadlocks preventions! -
6_allgather.py: 1st rank-divided algorihtm, supporting spatial/temporal -
7_reducescatter.py: -
8_allreduce.py -
9_alltoall.py
Each above collective can be used with mpirun -np 2/4/8 python ....
To integrate into your project, simply copy the four core module and the colletivce you interested into your project, no installation need!
The author orders these collectives into a learning curve that each collective gradually introduces some new (but not too much) knowledge.
Compared with NCCL, it cut off:
- Inter-Node Communication with RDMA
- Intra-Node Communication with PCIe (currently NvLink ONLY for enabling P2P)
Compared with other attempts in Triton + Communication, e.g., SymmetricMemory(PyTorch), it add more:
- Protocol and Topology support: with full LL128 protocol, indirect link is supported and you don't need NvSwitch!
- Usage support: only four dependencies,
cuda-python/mpi4py/torch/triton, no custom build or bindings!
The fundamental of (NCCL and MinTCCL) is that NvLink can be programmed via basic ld/st primitive (of PTX)
on machines with Unified Memory
and GPU P2P Access
support that can map GMEM address of the GPU to another.
As there's no difference between reading local GMEM, Triton, theoretically, can also be used for communications and MinTCCL tries to prove this.
- Collectives
- AllReduce
- AlltoAll
- Grouped P2P
- Topology
- Ring
- Tree / Binary Tree
- Hypercube
- Performance Tuning
- Fine-grained control of SM Layout like NCCL
- Arithemetics Types for ReduceOps
- FP16
- Generic Support of FP32/BF16/FP8
- On 8-GPU machines, MinTCCL will fail into deadlock, seems a problem of flag maangements, still debugging.