Skip to content

Conversation

@Kewen12
Copy link
Contributor

@Kewen12 Kewen12 commented Jul 1, 2025

Added GPU test job limit to make it consistent with current config https://github.com/llvm/llvm-zorg/blob/main/buildbot/osuosl/master/config/builders.py#L2027C31-L2027C77

@llvmbot
Copy link
Member

llvmbot commented Jul 1, 2025

@llvm/pr-subscribers-offload

@llvm/pr-subscribers-backend-amdgpu

Author: None (Kewen12)

Changes

Added GPU test job limit to make it consistent with current config https://github.com/llvm/llvm-zorg/blob/main/buildbot/osuosl/master/config/builders.py#L2027C31-L2027C77


Full diff: https://github.com/llvm/llvm-project/pull/146611.diff

1 Files Affected:

  • (modified) offload/cmake/caches/AMDGPULibcBot.cmake (+1)
diff --git a/offload/cmake/caches/AMDGPULibcBot.cmake b/offload/cmake/caches/AMDGPULibcBot.cmake
index 728dfe3f0a3f1..a772043c79669 100644
--- a/offload/cmake/caches/AMDGPULibcBot.cmake
+++ b/offload/cmake/caches/AMDGPULibcBot.cmake
@@ -18,3 +18,4 @@ set(CLANG_DEFAULT_RTLIB "compiler-rt" STRING "")
 
 set(LLVM_RUNTIME_TARGETS default;amdgcn-amd-amdhsa CACHE STRING "")
 set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc" CACHE STRING "")
+set(RUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_JOBS 4 CACHE STRING "")

@Kewen12
Copy link
Contributor Author

Kewen12 commented Jul 1, 2025

No write access. would appreciate if you could help review @jplehr @jhuber6 TIA!

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not entirely sure this is safe, but it's worth a shot. Maybe they fixed some HSA bugs since I last checked.

@jhuber6 jhuber6 merged commit 2b16af8 into llvm:main Jul 2, 2025
10 checks passed
@Kewen12
Copy link
Contributor Author

Kewen12 commented Jul 2, 2025

Thanks for the help @jhuber6! I might not have full context here, you mean enabling this flag may not be safe?

@Kewen12 Kewen12 deleted the cmake-parallel-run-limits branch July 2, 2025 03:48
@jhuber6
Copy link
Contributor

jhuber6 commented Jul 2, 2025

Thanks for the help @jhuber6! I might not have full context here, you mean enabling this flag may not be safe?

Yes, the HSA runtime would routinely crash when many of these tests were run in parallel. I poked at it through https://github.com/jhuber6/hsa_test awhile back, pretty much just found that loading binaries in parallel would crash depending on the machine.

@shiltian
Copy link
Contributor

shiltian commented Jul 2, 2025

"routinely crash" love it :-D

@shiltian
Copy link
Contributor

shiltian commented Jul 2, 2025

pretty much just found that loading binaries in parallel would crash depending on the machine

This sounds like a loader issue. CC @kzhuravl

@jhuber6
Copy link
Contributor

jhuber6 commented Jul 2, 2025

pretty much just found that loading binaries in parallel would crash depending on the machine

This sounds like a loader issue. CC @kzhuravl

Who knows, maybe they fixed it, haven't checked in awhile.

@jplehr
Copy link
Contributor

jplehr commented Jul 2, 2025

We've been running this config on the current libc bot for about 6 months now or so (ROCm 6.2 and ROCm 6.3) and did not see spurious fails in that time.
So I guess, something has improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants