🌱 MinTCCL: MINimalistic Triton Communication Collective Library

MinTCCL is an attempt on using pure Python (Triton + PyTorch) to implement a minimalistic NCCL in ~1,200 line of Python. The author envision it as:

a development repo that you can easily hack, deploy and fuse the collectives with other Triton kernels.
an educational tutorial that you can read through (clean codes and comments) to learn how we gradually implement advanced AlltoAll starting from simplest SendRecv.

Structure and How to use

The structure is too simple to use folders, the core lies in four files:

common.py implements common utilities and configurations via envariables.
primitive.py implements lacked device built-ins (like syncwarp) and 128-bit Ops.
protocol.py implements NCCL's LL128 protocol primitives, relies on primitive.py.
topology.py implements topology and routing algorithms, e.g., ring.

Based on these four modules, we can build MPI/NCCL Communication Collectives:

1_sendrecv_direct.py: simplest load/store for send/recv in direct link
2_sendrecv_indirect.py: use LL128 protocol to support indirect link (multi-hop)
3_sendrecv_buff.py: use buffer management to save memory usage
4_broadcast.py: 1st collective using ring algorithm
5_reduce.py: 1st computation and deadlocks preventions!
6_allgather.py: 1st rank-divided algorihtm, supporting spatial/temporal
7_reducescatter.py:
8_allreduce.py
9_alltoall.py

Each above collective can be used with mpirun -np 2/4/8 python .... To integrate into your project, simply copy the four core module and the colletivce you interested into your project, no installation need!

The author orders these collectives into a learning curve that each collective gradually introduces some new (but not too much) knowledge.

How it differs from ...

Compared with NCCL, it cut off:

Inter-Node Communication with RDMA
Intra-Node Communication with PCIe (currently NvLink ONLY for enabling P2P)

Compared with other attempts in Triton + Communication, e.g., SymmetricMemory(PyTorch), it add more:

Protocol and Topology support: with full LL128 protocol, indirect link is supported and you don't need NvSwitch!
Usage support: only four dependencies, cuda-python/mpi4py/torch/triton, no custom build or bindings!

How it works

The fundamental of (NCCL and MinTCCL) is that NvLink can be programmed via basic ld/st primitive (of PTX) on machines with Unified Memory and GPU P2P Access support that can map GMEM address of the GPU to another. As there's no difference between reading local GMEM, Triton, theoretically, can also be used for communications and MinTCCL tries to prove this.

Roadmap

Collectives
- AllReduce
- AlltoAll
- Grouped P2P
Topology
- Ring
- Tree / Binary Tree
- Hypercube
Performance Tuning
- Fine-grained control of SM Layout like NCCL
Arithemetics Types for ReduceOps
- FP16
- Generic Support of FP32/BF16/FP8

Known Issues

On 8-GPU machines, MinTCCL will fail into deadlock, seems a problem of flag maangements, still debugging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌱 MinTCCL: MINimalistic Triton Communication Collective Library

Structure and How to use

How it differs from ...

How it works

Roadmap

Known Issues

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
1_sendrecv_direct.py		1_sendrecv_direct.py
2_sendrecv_indirect.py		2_sendrecv_indirect.py
3_sendrecv_buffer.py		3_sendrecv_buffer.py
4_broadcast.py		4_broadcast.py
5_reduce.py		5_reduce.py
6_allgather.py		6_allgather.py
7_reducescatter.py		7_reducescatter.py
LICENSE		LICENSE
README.md		README.md
common.py		common.py
install.sh		install.sh
primitive.py		primitive.py
protocol.py		protocol.py
topology.py		topology.py

License

huangs0/mintccl

Folders and files

Latest commit

History

Repository files navigation

🌱 MinTCCL: MINimalistic Triton Communication Collective Library

Structure and How to use

How it differs from ...

How it works

Roadmap

Known Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages