Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added asset/build_amd_snapshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/test_amd_single_node_build.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
59 changes: 3 additions & 56 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,6 @@
- Torch 2.4.1
- Clang 19

#### if for AMD GPU:
- ROCM 6.3.0
- Torch 2.4.1 with ROCM support



Dependencies with other versions may also work well, but this is not guaranteed. If you find any problem in installing, please tell us in Issues.

Expand All @@ -26,10 +21,7 @@ Dependencies with other versions may also work well, but this is not guaranteed.
pip3 install black "clang-format==19.1.2" pre-commit ruff yapf==0.43
pip3 install ninja cmake wheel pybind11 cuda-python==12.4 numpy chardet pytest
```
for AMD GPU, use torch with rocm support and hip-python
```sh
python3 -m pip install -i https://test.pypi.org/simple hip-python>=6.3.0
```

4. Apply NVSHMEM fix
(Disclaimer: This step is because of NVSHMEM license requirements, it is illegal to release any modified codes or patch.)

Expand Down Expand Up @@ -84,8 +76,6 @@ Dependencies with other versions may also work well, but this is not guaranteed.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/llvm-project/build/lib
```

For ROCMSHMEM on AMD GPU, no explicit build required as the building process is integrated with Triton-distributed.

6. Build Triton-distributed
Then you can build Triton-distributed.
```sh
Expand Down Expand Up @@ -114,20 +104,13 @@ This example runs on a single node with 8 H800 GPUs.
```sh
bash ./third_party/distributed/launch.sh ./third_party/distributed/distributed/test/nvidia/test_ag_gemm_intra_node.py --case correctness_tma
```
For AMD CDNA3 GPUs:
```sh
bash ./third_party/distributed/launch_amd.sh ./third_party/distributed/distributed/test/amd/test_ag_gemm_intra_node.py 8192 53248 16384
```

#### GEMM ReduceScatter example on single node
This example runs on a single node with 8 H800 GPUs.
```sh
bash ./third_party/distributed/launch.sh ./third_party/distributed/distributed/test/nvidia/test_gemm_rs_multi_node.py 8192 8192 29568
```
For AMD CDNA3 GPUs:
```sh
bash ./third_party/distributed/launch_amd.sh ./third_party/distributed/distributed/test/amd/test_gemm_rs_intra_node.py 8192 3584 14336
```

#### NVSHMEM example in Triton-distributed
```sh
bash ./third_party/distributed/launch.sh ./third_party/distributed/distributed/test/nvidia/test_nvshmem_api.py
Expand Down Expand Up @@ -173,40 +156,4 @@ bash ./third_party/distributed/launch.sh ./third_party/distributed/distributed/t
# moe rs
bash ./third_party/distributed/launch.sh ./third_party/distributed/distributed/test/test_moe_reduce_rs_intra_node.py 8192 2048 1536 32 2
bash ./third_party/distributed/launch.sh ./third_party/distributed/distributed/test/test_moe_reduce_rs_intra_node.py 8192 2048 1536 32 2 --check
```

## To use Triton-distributed with the AMD backend:
- Starting from the rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.4 Docker container
#### Steps:
1. Clone the repo
```sh
git clone https://github.com/ByteDance-Seed/Triton-distributed.git
```
2. Update submodules
```sh
cd Triton-distributed/
git submodule update --init --recursive
```
3. Install dependencies
```sh
sudo apt-get update -y
sudo apt install -y libopenmpi-dev
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3 --no-deps
./third_party/rocshmem_bind/build.sh
python3 -m pip install -i https://test.pypi.org/simple hip-python~=6.3.2 (or whatever Rocm version you have)
pip3 install pybind11
```
4. Build Triton-distributed
```sh
pip3 install -e python --verbose --no-build-isolation
```
### Test your installation
#### GEMM ReduceScatter example on single node
```sh
bash ./third_party/distributed/launch_amd.sh ./third_party/distributed/distributed/test/amd/test_ag_gemm_intra_node.py 8192 8192 29568
```
and see the following (reduced) output
```sh
torchrun --node_rank=0 --nproc_per_node=8 --nnodes=1 ./third_party/distributed/distributed/test/amd/test_ag_gemm_intra_node.py 8192 8192 29568
✅ Triton and Torch match
```
```
57 changes: 57 additions & 0 deletions docs/build_amd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Build Triton-distributed (RocSHMEM)

## The best practice to use Trition-distributed in AMD GPU

- ROCm 6.3.3
- torch-2.6/2.8 (torch-2.6 has major improvement, compatible with SGLang, vLLM)
- python3.12.8
- MI300X/MI325X
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to mention MI300X? @wenlei-bao please check this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, please use CDNA3.
Also can you make it compatible with ROCm 6.3.0 and torch 2.5.1 ? this is the version we mostly used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


Dependencies with other versions may also work well, but this is not guaranteed. If you find any problem in installing, please tell us in Issues.

## Setup without docker

1. make sure torch-rocm is installed for ROCm SDK 6.3.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we want it compatible with 6.3.0.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no major changes introudced, the build should work with 6.3.0. I can update the version number.

```sh
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3
```
2. install ompi for ROCm SDK
```
sudo apt-get update -y && \
sudo apt install -y libopenmpi-dev
```
3. install other dependencies
```
python3 -m pip install -i https://test.pypi.org/simple hip-python~=6.3.3 # or whatever Rocm version you have
pip3 install pybind11
```

#### Warnning of install inside existing dockder

Make sure following repositories granted permission to clone submodules

```
# /workspace/3rdparty/ point to the parent folder you cloned for `Triton-distributed`
git config --global --add safe.directory /workspace/3rdparty/Triton-distributed/3rdparty/rocshmem
Copy link
Collaborator

@KnowingNothing KnowingNothing May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is /workspace/3rdparty? I think this is not a general instruction, only needed in personal docker container.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replace it with envronment variable TRITON_DIST_HOME. Just for conveniences of description. Since the non-docker build process relies on git submodules, these infomation are better to be appended.

export TRITON_DIST_HOME=$(readlink -f `pwd`)

git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed/3rdparty/rocshmem
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed/3rdparty/triton
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed

git submodule update --init --recursive

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we want this specific setup config.

git config --global --add safe.directory /workspace/3rdparty/Triton-distributed/3rdparty/triton
git config --global --add safe.directory /workspace/3rdparty/Triton-distributed

git submodule update --init --recursive
```

## Build

> python3 python/setup.py build_ext

![build_amd](../asset/build_amd_snapshot.png)

## Test

Currently only single node build supported, multi-node build will be supported soon.

- Single node test
```
bash ./scripts/launch_amd.sh python/triton_dist/test/amd/test_ag_gemm_intra_node.py 8192 8192 29568
```

![build_amd](../asset/test_amd_single_node_build.png)
5 changes: 4 additions & 1 deletion python/build_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,8 @@ def copy_apply_patches():
for file in files:
source_file = os.path.join(root, file)
target_file = os.path.join(target_dir, file)
shutil.copy2(source_file, target_file)
try:
shutil.copy2(source_file, target_file)
except Exception:
shutil.copyfile(source_file, target_file)
print(f"Copied {source_file} to {target_file}")
3 changes: 3 additions & 0 deletions shmem/rocshmem_bind/scripts/build_rshm_ipc_single.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ else
install_path=$1
fi

hip_cmake_path="/opt/rocm/lib/cmake/hip;/opt/rocm/lib/cmake/rocprim;/opt/rocm/lib/cmake/rocthrust"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this path necessary? Can we assume there is /opt/rocm all the time?

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not standard cmake perfix path. Maybe added into the container, but I cannot ensure that. If you remove them, the build could abort unexpectedly.

Copy link
Collaborator

@wenlei-bao wenlei-bao May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. What error do you hit without it? @YellowHCH do you remember this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't encounter this problem. Maybe it's because of the different docker image used ? @yiakwy-xpu-ml-framework-team Could you share the image you used ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiakwy-xpu-ml-framework-team We have successfully built our project on MI300 using the Docker image rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.4, and it functions as expected. I will proceed to test the Docker image you suggested at a later time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiakwy-xpu-ml-framework-team As @YellowHCH said it work without this line change, so can you please update and remove it? Thanks.


src_path=$(dirname "$(realpath $0)")/../../../3rdparty/rocshmem/

cmake \
Expand All @@ -29,6 +31,7 @@ cmake \
-DUSE_SINGLE_NODE=ON \
-DUSE_HOST_SIDE_HDP_FLUSH=OFF \
-DBUILD_LOCAL_GPU_TARGET_ONLY=ON \
-DCMAKE_PREFIX_PATH="$hip_cmake_path" \
$src_path
cmake --build . --parallel
cmake --install .