-
Notifications
You must be signed in to change notification settings - Fork 129
update amd build doc #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
ad5ec67
13c85c3
b79c734
b4c5dba
c7d7529
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
yiakwy-xpu-ml-framework-team marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # Build Triton-distributed (RocSHMEM) | ||
|
|
||
| ## The best practice to use Trition-distributed in AMD GPU | ||
|
|
||
| - ROCm 6.3.3 | ||
| - torch-2.6/2.8 (torch-2.6 has major improvement, compatible with SGLang, vLLM) | ||
| - python3.12.8 | ||
| - MI300X/MI325X | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it ok to mention MI300X? @wenlei-bao please check this.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, please use CDNA3.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure |
||
|
|
||
| Dependencies with other versions may also work well, but this is not guaranteed. If you find any problem in installing, please tell us in Issues. | ||
|
|
||
| ## Setup without docker | ||
|
|
||
| 1. make sure torch-rocm is installed for ROCm SDK 6.3.3 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we want it compatible with 6.3.0.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since no major changes introudced, the build should work with 6.3.0. I can update the version number. |
||
| ```sh | ||
| pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3 | ||
| ``` | ||
| 2. install ompi for ROCm SDK | ||
| ``` | ||
| sudo apt-get update -y && \ | ||
| sudo apt install -y libopenmpi-dev | ||
| ``` | ||
| 3. install other dependencies | ||
| ``` | ||
| python3 -m pip install -i https://test.pypi.org/simple hip-python~=6.3.3 # or whatever Rocm version you have | ||
| pip3 install pybind11 | ||
| ``` | ||
|
|
||
| #### Warnning of install inside existing dockder | ||
|
|
||
| Make sure following repositories granted permission to clone submodules | ||
|
|
||
| ``` | ||
| # /workspace/3rdparty/ point to the parent folder you cloned for `Triton-distributed` | ||
| git config --global --add safe.directory /workspace/3rdparty/Triton-distributed/3rdparty/rocshmem | ||
|
||
| git config --global --add safe.directory /workspace/3rdparty/Triton-distributed/3rdparty/triton | ||
| git config --global --add safe.directory /workspace/3rdparty/Triton-distributed | ||
|
|
||
| git submodule update --init --recursive | ||
| ``` | ||
|
|
||
| ## Build | ||
|
|
||
| > python3 python/setup.py build_ext | ||
|
|
||
|  | ||
yiakwy-xpu-ml-framework-team marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Test | ||
|
|
||
| Currently only single node build supported, multi-node build will be supported soon. | ||
|
|
||
| - Single node test | ||
| ``` | ||
| bash ./scripts/launch_amd.sh python/triton_dist/test/amd/test_ag_gemm_intra_node.py 8192 8192 29568 | ||
| ``` | ||
|
|
||
|  | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,6 +9,8 @@ else | |
| install_path=$1 | ||
| fi | ||
|
|
||
| hip_cmake_path="/opt/rocm/lib/cmake/hip;/opt/rocm/lib/cmake/rocprim;/opt/rocm/lib/cmake/rocthrust" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this path necessary? Can we assume there is /opt/rocm all the time?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are not standard cmake perfix path. Maybe added into the container, but I cannot ensure that. If you remove them, the build could abort unexpectedly.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about this. What error do you hit without it? @YellowHCH do you remember this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't encounter this problem. Maybe it's because of the different docker image used ? @yiakwy-xpu-ml-framework-team Could you share the image you used ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @YellowHCH I used this dockerfile
Which was built upon standard rocm6.3 image. The SDK is updated to 6.3.3. :
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yiakwy-xpu-ml-framework-team We have successfully built our project on MI300 using the Docker image
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yiakwy-xpu-ml-framework-team As @YellowHCH said it work without this line change, so can you please update and remove it? Thanks. |
||
|
|
||
| src_path=$(dirname "$(realpath $0)")/../../../3rdparty/rocshmem/ | ||
|
|
||
| cmake \ | ||
|
|
@@ -29,6 +31,7 @@ cmake \ | |
| -DUSE_SINGLE_NODE=ON \ | ||
| -DUSE_HOST_SIDE_HDP_FLUSH=OFF \ | ||
| -DBUILD_LOCAL_GPU_TARGET_ONLY=ON \ | ||
| -DCMAKE_PREFIX_PATH="$hip_cmake_path" \ | ||
| $src_path | ||
| cmake --build . --parallel | ||
| cmake --install . | ||
Uh oh!
There was an error while loading. Please reload this page.