update amd build doc#25
update amd build doc#25yiakwy-xpu-ml-framework-team wants to merge 5 commits intoByteDance-Seed:mainfrom
Conversation
|
@KnowingNothing could you have a look at it ? |
| install_path=$1 | ||
| fi | ||
|
|
||
| hip_cmake_path="/opt/rocm/lib/cmake/hip;/opt/rocm/lib/cmake/rocprim;/opt/rocm/lib/cmake/rocthrust" |
There was a problem hiding this comment.
Is this path necessary? Can we assume there is /opt/rocm all the time?
There was a problem hiding this comment.
These are not standard cmake perfix path. Maybe added into the container, but I cannot ensure that. If you remove them, the build could abort unexpectedly.
There was a problem hiding this comment.
Not sure about this. What error do you hit without it? @YellowHCH do you remember this?
There was a problem hiding this comment.
I don't encounter this problem. Maybe it's because of the different docker image used ? @yiakwy-xpu-ml-framework-team Could you share the image you used ?
There was a problem hiding this comment.
@YellowHCH I used this dockerfile
https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.rocm
Which was built upon standard rocm6.3 image. The SDK is updated to 6.3.3. :
https://github.com/yiakwy-xpu-ml-framework-team/Tools-dockerhub/blob/main/rocm/update_sdk_6.3.3.sh
There was a problem hiding this comment.
@yiakwy-xpu-ml-framework-team We have successfully built our project on MI300 using the Docker image rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.4, and it functions as expected. I will proceed to test the Docker image you suggested at a later time.
There was a problem hiding this comment.
@yiakwy-xpu-ml-framework-team As @YellowHCH said it work without this line change, so can you please update and remove it? Thanks.
| - ROCm 6.3.3 | ||
| - torch-2.6/2.8 (torch-2.6 has major improvement, compatible with SGLang, vLLM) | ||
| - python3.12.8 | ||
| - MI300X/MI325X |
There was a problem hiding this comment.
Is it ok to mention MI300X? @wenlei-bao please check this.
There was a problem hiding this comment.
No, please use CDNA3.
Also can you make it compatible with ROCm 6.3.0 and torch 2.5.1 ? this is the version we mostly used.
docs/build_amd.md
Outdated
|
|
||
| ``` | ||
| # /workspace/3rdparty/ point to the parent folder you cloned for `Triton-distributed` | ||
| git config --global --add safe.directory /workspace/3rdparty/Triton-distributed/3rdparty/rocshmem |
There was a problem hiding this comment.
what is /workspace/3rdparty? I think this is not a general instruction, only needed in personal docker container.
There was a problem hiding this comment.
I replace it with envronment variable TRITON_DIST_HOME. Just for conveniences of description. Since the non-docker build process relies on git submodules, these infomation are better to be appended.
export TRITON_DIST_HOME=$(readlink -f `pwd`)
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed/3rdparty/rocshmem
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed/3rdparty/triton
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed
git submodule update --init --recursive
There was a problem hiding this comment.
Yeah, I don't think we want this specific setup config.
| - ROCm 6.3.3 | ||
| - torch-2.6/2.8 (torch-2.6 has major improvement, compatible with SGLang, vLLM) | ||
| - python3.12.8 | ||
| - MI300X/MI325X |
There was a problem hiding this comment.
No, please use CDNA3.
Also can you make it compatible with ROCm 6.3.0 and torch 2.5.1 ? this is the version we mostly used.
|
|
||
| ## Setup without docker | ||
|
|
||
| 1. make sure torch-rocm is installed for ROCm SDK 6.3.3 |
There was a problem hiding this comment.
Here we want it compatible with 6.3.0.
There was a problem hiding this comment.
Since no major changes introudced, the build should work with 6.3.0. I can update the version number.
| install_path=$1 | ||
| fi | ||
|
|
||
| hip_cmake_path="/opt/rocm/lib/cmake/hip;/opt/rocm/lib/cmake/rocprim;/opt/rocm/lib/cmake/rocthrust" |
There was a problem hiding this comment.
Not sure about this. What error do you hit without it? @YellowHCH do you remember this?
Update AMD build doc
The current build is broken after refactor.
Note
Verifed