Skip to content

update amd build doc#25

Open
yiakwy-xpu-ml-framework-team wants to merge 5 commits intoByteDance-Seed:mainfrom
yiakwy-xpu-ml-framework-team:update_amd_build
Open

update amd build doc#25
yiakwy-xpu-ml-framework-team wants to merge 5 commits intoByteDance-Seed:mainfrom
yiakwy-xpu-ml-framework-team:update_amd_build

Conversation

@yiakwy-xpu-ml-framework-team
Copy link

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented May 16, 2025

Update AMD build doc

The current build is broken after refactor.

Note

  • to build rocshmem hip cmake path has been added correctly :

ROCm/rocSHMEM#130

  • "cp -Rrf --preserve" may cause permission issue in remote non-previlidged container

Verifed

test_amd_single_node_build

@CLAassistant
Copy link

CLAassistant commented May 16, 2025

CLA assistant check
All committers have signed the CLA.

@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented May 16, 2025

@KnowingNothing could you have a look at it ?

cc @CRobeck @knwng

install_path=$1
fi

hip_cmake_path="/opt/rocm/lib/cmake/hip;/opt/rocm/lib/cmake/rocprim;/opt/rocm/lib/cmake/rocthrust"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this path necessary? Can we assume there is /opt/rocm all the time?

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not standard cmake perfix path. Maybe added into the container, but I cannot ensure that. If you remove them, the build could abort unexpectedly.

Copy link
Collaborator

@wenlei-bao wenlei-bao May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. What error do you hit without it? @YellowHCH do you remember this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't encounter this problem. Maybe it's because of the different docker image used ? @yiakwy-xpu-ml-framework-team Could you share the image you used ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiakwy-xpu-ml-framework-team We have successfully built our project on MI300 using the Docker image rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.4, and it functions as expected. I will proceed to test the Docker image you suggested at a later time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiakwy-xpu-ml-framework-team As @YellowHCH said it work without this line change, so can you please update and remove it? Thanks.

- ROCm 6.3.3
- torch-2.6/2.8 (torch-2.6 has major improvement, compatible with SGLang, vLLM)
- python3.12.8
- MI300X/MI325X
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to mention MI300X? @wenlei-bao please check this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, please use CDNA3.
Also can you make it compatible with ROCm 6.3.0 and torch 2.5.1 ? this is the version we mostly used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


```
# /workspace/3rdparty/ point to the parent folder you cloned for `Triton-distributed`
git config --global --add safe.directory /workspace/3rdparty/Triton-distributed/3rdparty/rocshmem
Copy link
Collaborator

@KnowingNothing KnowingNothing May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is /workspace/3rdparty? I think this is not a general instruction, only needed in personal docker container.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replace it with envronment variable TRITON_DIST_HOME. Just for conveniences of description. Since the non-docker build process relies on git submodules, these infomation are better to be appended.

export TRITON_DIST_HOME=$(readlink -f `pwd`)

git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed/3rdparty/rocshmem
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed/3rdparty/triton
git config --global --add safe.directory $TRITON_DIST_HOME/Triton-distributed

git submodule update --init --recursive

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we want this specific setup config.

- ROCm 6.3.3
- torch-2.6/2.8 (torch-2.6 has major improvement, compatible with SGLang, vLLM)
- python3.12.8
- MI300X/MI325X
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, please use CDNA3.
Also can you make it compatible with ROCm 6.3.0 and torch 2.5.1 ? this is the version we mostly used.


## Setup without docker

1. make sure torch-rocm is installed for ROCm SDK 6.3.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we want it compatible with 6.3.0.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no major changes introudced, the build should work with 6.3.0. I can update the version number.

install_path=$1
fi

hip_cmake_path="/opt/rocm/lib/cmake/hip;/opt/rocm/lib/cmake/rocprim;/opt/rocm/lib/cmake/rocthrust"
Copy link
Collaborator

@wenlei-bao wenlei-bao May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. What error do you hit without it? @YellowHCH do you remember this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants