You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API
Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix#556
---------
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
MSCCL++ DSL (domain-specific language) enables concise expression of collective algorithms as Python functions.
4
+
MSCCL++ offers pythonic utilities to author, JIT-compile, register, and select execution plans. This guide walks through two integration paths: a customized MSCCL++ communicator and NCCL interposition that accelerates existing PyTorch `backend="nccl"` workloads.
5
+
6
+
## Initial Setup
7
+
8
+
Run the following from the repository root after completing the basic project setup:
9
+
10
+
1. Install Python dependencies.
11
+
```bash
12
+
pip install -r ./python/<requirements_file>
13
+
```
14
+
Replace `<requirements_file>` with the file that matches your environment (e.g., `requirements_cuda11.txt`, `requirements_cuda12.txt`, or `requirements_rocm6.txt`).
15
+
16
+
2. Install the module and generate default algorithm plans.
17
+
```bash
18
+
pip install .&& python3 -m mscclpp --install
19
+
```
20
+
21
+
## Integration Options
22
+
23
+
MSCCL++ DSL integrates into your training or inference workload in two ways:
24
+
1.**Custom MSCCL++ Communicator** — directly manage an MSCCL++ communicator and launch collectives with the MSCCL++ executor.
25
+
2.**NCCL Interposition** — keep using `backend="nccl"`; MSCCL++ intercepts NCCL calls at runtime for drop-in acceleration.
26
+
27
+
Both paths follow the same high-level flow:
28
+
1. Author (or reuse) a collective algorithm with the MSCCL++ DSL.
29
+
2. Compile it into an execution plan.
30
+
3. Register the plan with the MSCCL++ runtime.
31
+
4. Configure a selector to choose the plan for each collective call.
32
+
33
+
Below we show an AllReduce example and then detail each integration option.
34
+
35
+
### Example: AllReduce in the MSCCL++ DSL
36
+
The snippet defines an AllReduce that uses NVLS for intra-node reduce-scatter followed by broadcast.
### Integrate with MSCCL++ customized communicator
96
+
Use when you want a PyTorch‑compatible interface with fine‑grained control. You manage the communicator, compile/register DSL plans, and invoke collectives via a thin wrapper. The example below shows an AllReduce built on the MSCCL++ communicator and executor.
0 commit comments