Skip to content

Conversation

stellaraccident
Copy link
Collaborator

  • Includes a couple of workarounds that are unfortunate but would be hard to patch/fix at the root in one step:
    • Some of the environment variables needed to locate ROCm for the PyTorch build (which shouldn't be necessary at all but c'est la vie for now) conflict badly with the clang driver heuristics for locating device bitcode. Workaround is to also manually set the HIP_DEVICE_LIB_PATH and curse at the stars about removing all of these legacy special vars.
    • PyTorch at head uses rocm-smi-lib for distributed, and including those headers does not advertise the transitive include dirs for the sysdeps, which causes us to not find libdrm (presumably most old ROCm installs treated that as a system library, whereas we vendor it and need to propagate its header path through find_package). Workaround is to manually add the rocm_sysdeps include and lib dir.
  • Adds a --build-triton --no-build-triton flag for ergonomics when iterating.
  • Unconditionally sets USE_ROCM=ON in all paths, not just for Windows: new branches require this.
  • Adds the _rocm_init.py for rocm wheel bootstrapping logic as described in docs/packaging/python_packaging.md and also updates that to match how it was actually landed in PyTorch.
  • Adds docs indicating that it is valid to checkout the pytorch nightly branch, which tracks the most recent pytorch.org nightly build.

This should be NFC on every other pytorch build. Followup needs to add an actual head-on-head nightly build pipeline.

Tested: Local build with gfx94X wheels produced a working torch/torchaudio/torchvision install. I wasn't actually running on a 942 system so it didn't do much from there but did import and let me create tensors. The radeon rocm wheels are known broken today due to a CK bug, so can pick up with those tomorrow.

* Includes a couple of workarounds that are unfortunate but would be hard to patch/fix at the root in one step:
  * Some of the environment variables needed to locate ROCm for the PyTorch build (which shouldn't be necessary at all but c'est la vie for now) conflict badly with the clang driver heuristics for locating device bitcode. Workaround is to also manually set the `HIP_DEVICE_LIB_PATH` and curse at the stars about removing all of these legacy special vars.
  * PyTorch at head uses rocm-smi-lib for distributed, and including those headers does not advertise the transitive include dirs for the sysdeps, which causes us to not find libdrm (presumably most old ROCm installs treated that as a system library, whereas we vendor it and need to propagate its header path through find_package). Workaround is to manually add the rocm_sysdeps include and lib dir.
* Adds a `--build-triton` `--no-build-triton` flag for ergonomics when iterating.
* Unconditionally sets `USE_ROCM=ON` in all paths, not just for Windows: new branches require this.
* Adds the `_rocm_init.py` for rocm wheel bootstrapping logic as described in `docs/packaging/python_packaging.md` and also updates that to match how it was actually landed in PyTorch.
* Adds docs indicating that it is valid to checkout the pytorch `nightly` branch, which tracks the most recent pytorch.org nightly build.

Tested: Local build with gfx94X wheels produced a working torch/torchaudio/torchvision install. I wasn't actually running on a 942 system so it didn't do much from there but did import and let me create tensors. The radeon rocm wheels are known broken today due to a CK bug, so can pick up with those tomorrow.
@stellaraccident
Copy link
Collaborator Author

Addressed ergonomic comments and verified locally that resulting torch+triton installs and functions.

@stellaraccident stellaraccident merged commit ce2d1b8 into main Jul 2, 2025
4 checks passed
@stellaraccident stellaraccident deleted the pytorch_nightly_build branch July 2, 2025 23:57
@github-project-automation github-project-automation bot moved this from TODO to Done in TheRock Triage Jul 2, 2025
ScottTodd added a commit that referenced this pull request Jul 10, 2025
Progress on #827.

Follow-up to #959, expanding support
on Windows.

Without this I get a warning from `build_prod_wheels.py` on Windows:
```
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=False -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=D:\b\pytorch_v2.7.0\torch -DCMAKE_PREFIX_PATH=D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages;D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel\lib\cmake -DPython_EXECUTABLE=D:\projects\TheRock\external-builds\pytorch\.venv\Scripts\python.exe -DTORCH_BUILD_VERSION=2.7.0a0+rocmsdk20250709 -DUSE_FLASH_ATTENTION=0 -DUSE_GLOO=OFF -DUSE_KINETO=OFF -DUSE_MEM_EFF_ATTENTION=0 -DUSE_NUMPY=True -DUSE_ROCM=ON D:\b\pytorch_v2.7.0
cmake --build . --target install --config Release
rocm version 7.0.0.dev0:
  PYTHON VERSION: 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
  CMAKE_PREFIX_PATH = D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel\lib\cmake
  BIN = D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel\bin
  ROCM_HOME = D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel
  PATH = ...
  Using default PYTORCH_ROCM_ARCH from rocm-sdk targets: gfx1100;gfx1101;gfx1102
WARNING: Default location of device libs not found. Relying on clang heuristics which are known to be buggy in this configuration
--- Not building triton (no --triton-dir)
  Default PYTORCH_BUILD_VERSION: 2.7.0a0+rocmsdk20250709
--- PYTORCH_EXTRA_INSTALL_REQUIREMENTS = rocm[libraries]==7.0.0.dev0
```

Followed by errors late into the build:
```
[6566/7081] Linking CXX shared library bin\torch_cpu.dll
[6567/7081] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/torch_hip_generated_Sleep.hip.obj
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/torch_hip_generated_Sleep.hip.obj D:/b/pytorch_v2.7.0/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/torch_hip_generated_Sleep.hip.obj 
C:\Windows\system32\cmd.exe /C "cd /D D:\b\pytorch_v2.7.0\build\caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\hip && D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E make_directory D:/b/pytorch_v2.7.0/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/. && D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\cmake\data\bin\cmake.exe -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=D:/b/pytorch_v2.7.0/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/./torch_hip_generated_Sleep.hip.obj -P D:/b/pytorch_v2.7.0/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/torch_hip_generated_Sleep.hip.obj.cmake"
clang: warning: argument unused during compilation: '--offload-compress' [-Wunused-command-line-argument]
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
failed to execute:""D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/lib/llvm/bin\clang.exe"   --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 -O3  -c -x hip "D:/b/pytorch_v2.7.0/aten/src/ATen/hip/Sleep.hip" -o "D:/b/pytorch_v2.7.0/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/./torch_hip_generated_Sleep.hip.obj" --offload-compress -fclang-abi-compat=17 -DUSE_ROCM -D__HIP_PLATFORM_AMD__ -DTORCH_HIP_BUILD_MAIN_LIB -DROCM_ON_WINDOWS -DROCM_VERSION=85772 -DTORCH_HIP_VERSION=605 -DONNX_ML=1 -DONNXIFI_ENABLE_EXT=1 -DONNX_NAMESPACE=onnx_torch -D_CRT_SECURE_NO_DEPRECATE=1 -DUSE_EXTERNAL_MZCRC -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DEXPORT_AOTI_FUNCTIONS -DWIN32_LEAN_AND_MEAN -D_UCRT_LEGACY_INFINITY -DNOMINMAX -DUSE_MIMALLOC -DUSE_PROF_API=1 -DAT_PER_OPERATOR_HEADERS -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DROCM_USE_FLOAT16 -D__HIP_PLATFORM_AMD__ -DFMT_HEADER_ONLY=1 -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=605 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -fms-extensions -Wno-ignored-attributes -fno-gpu-rdc -ID:/b/pytorch_v2.7.0/build/aten/src -ID:/b/pytorch_v2.7.0/aten/src -ID:/b/pytorch_v2.7.0/build -ID:/b/pytorch_v2.7.0 -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/b/pytorch_v2.7.0/third_party/protobuf/src -ID:/b/pytorch_v2.7.0/third_party/XNNPACK/include -ID:/b/pytorch_v2.7.0/third_party/ittapi/include -ID:/b/pytorch_v2.7.0/cmake/../third_party/eigen -ID:/b/pytorch_v2.7.0/third_party/onnx -ID:/b/pytorch_v2.7.0/build/third_party/onnx -ID:/b/pytorch_v2.7.0/torch/include -ID:/b/pytorch_v2.7.0/third_party/ideep/include -ID:/b/pytorch_v2.7.0/nlohmann -ID:/b/pytorch_v2.7.0/INTERFACE -ID:/b/pytorch_v2.7.0/third_party/nlohmann/include -ID:/b/pytorch_v2.7.0/third_party/mimalloc/include -I/include -I/hcc/include -I/rocblas/include -I/hipsparse/include -I/include/rccl/ -ID:/b/pytorch_v2.7.0/aten/src/THH -ID:/b/pytorch_v2.7.0/aten/src/ATen/hip -ID:/b/pytorch_v2.7.0/aten/src/ATen/../../../third_party/composable_kernel/include -ID:/b/pytorch_v2.7.0/aten/src/ATen/../../../third_party/composable_kernel/library/include -ID:/b/pytorch_v2.7.0/build/caffe2/aten/src/ATen/composable_kernel -ID:/b/pytorch_v2.7.0/third_party/fmt/include -ID:/b/pytorch_v2.7.0/aten/src -ID:/b/pytorch_v2.7.0/build/caffe2/aten/src -ID:/b/pytorch_v2.7.0/build/aten/src -ID:/b/pytorch_v2.7.0/aten/src -ID:/b/pytorch_v2.7.0/aten/src/ATen/.. -ID:/b/pytorch_v2.7.0/c10/hip/../.. -ID:/b/pytorch_v2.7.0/build -ID:/b/pytorch_v2.7.0/c10/../ -ID:/b/pytorch_v2.7.0/build -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/b/pytorch_v2.7.0/torch/csrc/api -ID:/b/pytorch_v2.7.0/torch/csrc/api/include -ID:/b/pytorch_v2.7.0/third_party/protobuf/src -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include/hiprand -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include/rocrand -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/b/pytorch_v2.7.0/build/aten/src -ID:/b/pytorch_v2.7.0/aten/src -ID:/b/pytorch_v2.7.0/build -ID:/b/pytorch_v2.7.0 -ID:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/_rocm_sdk_devel/include -ID:/b/pytorch_v2.7.0/third_party/protobuf/src -ID:/b/pytorch_v2.7.0/third_party/XNNPACK/include -ID:/b/pytorch_v2.7.0/third_party/ittapi/include -ID:/b/pytorch_v2.7.0/cmake/../third_party/eigen -ID:/b/pytorch_v2.7.0/third_party/onnx -ID:/b/pytorch_v2.7.0/build/third_party/onnx -ID:/b/pytorch_v2.7.0/torch/include -ID:/b/pytorch_v2.7.0/third_party/ideep/include -ID:/b/pytorch_v2.7.0/nlohmann -ID:/b/pytorch_v2.7.0/INTERFACE -ID:/b/pytorch_v2.7.0/third_party/nlohmann/include -ID:/b/pytorch_v2.7.0/third_party/mimalloc/include"

CMake Error at torch_hip_generated_Sleep.hip.obj.cmake:200 (message):
  Error generating file
  D:/b/pytorch_v2.7.0/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/./torch_hip_generated_Sleep.hip.obj
```

Also setting `env` entries using an explicit `str()` instead of a raw
`Path` since that led to other errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants