-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[libc][gpu] Disable loop unrolling in the throughput benchmark loop #153971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-libc Author: Leandro Lacerda (leandrolcampos) ChangesThis patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop by default. It also adds an opt-in switch for users who want to study instruction-level parallelism (ILP) effects. Motivation:
What this change does:
Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling by default yields fairer, more consistent numbers; users can still opt in to unrolling to probe peak ILP. Full diff: https://github.com/llvm/llvm-project/pull/153971.diff 3 Files Affected:
diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ca134b12a479..9e57d8e4590d6 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -2,6 +2,8 @@ add_subdirectory(timing)
add_custom_target(gpu-benchmark)
+option(LIBC_GPU_BENCHMARKS_ALLOW_UNROLL "Allow compiler loop unrolling in throughput loops" OFF)
+
function(add_benchmark benchmark_name)
cmake_parse_arguments(
"BENCHMARK"
@@ -14,6 +16,12 @@ function(add_benchmark benchmark_name)
if(NOT libc.src.time.clock IN_LIST TARGET_LLVMLIBC_ENTRYPOINTS)
message(FATAL_ERROR "target does not support clock")
endif()
+
+ set(benchmark_extra_flags "")
+ if(NOT LIBC_GPU_BENCHMARKS_ALLOW_UNROLL)
+ list(APPEND benchmark_extra_flags "-DLIBC_GPU_BENCHMARKS_DISABLE_UNROLL=1")
+ endif()
+
add_libc_hermetic(
${benchmark_name}
IS_GPU_BENCHMARK
@@ -26,6 +34,7 @@ function(add_benchmark benchmark_name)
${BENCHMARK_UNPARSED_ARGUMENTS}
COMPILE_OPTIONS
-flto
+ ${benchmark_extra_flags}
)
get_fq_target_name(${benchmark_name} fq_target_name)
set(fq_build_target_name ${fq_target_name}.__build__)
diff --git a/libc/benchmarks/gpu/timing/amdgpu/timing.h b/libc/benchmarks/gpu/timing/amdgpu/timing.h
index b4a174f729817..5c1d3a0582d45 100644
--- a/libc/benchmarks/gpu/timing/amdgpu/timing.h
+++ b/libc/benchmarks/gpu/timing/amdgpu/timing.h
@@ -117,6 +117,10 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
asm("" ::"s"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (auto input : inputs) {
asm("" ::"v"(input));
result = input;
@@ -146,6 +150,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
asm("" ::"s"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (auto input : inputs) {
asm("" ::"v"(input));
result = f(input);
@@ -174,6 +182,10 @@ static LIBC_INLINE uint64_t throughput_baseline(
asm("" ::"s"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (size_t i = 0; i < N; i++) {
T x = inputs1[i];
T y = inputs2[i];
@@ -206,6 +218,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
asm("" ::"s"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (size_t i = 0; i < N; i++) {
T x = inputs1[i];
T y = inputs2[i];
diff --git a/libc/benchmarks/gpu/timing/nvptx/timing.h b/libc/benchmarks/gpu/timing/nvptx/timing.h
index 0c93a67129b8d..e671e378c9e2e 100644
--- a/libc/benchmarks/gpu/timing/nvptx/timing.h
+++ b/libc/benchmarks/gpu/timing/nvptx/timing.h
@@ -106,6 +106,10 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
asm("" ::"llr"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (auto input : inputs) {
asm("" ::"r"(input));
result = input;
@@ -135,6 +139,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
asm("" ::"llr"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (auto input : inputs) {
asm("" ::"r"(input));
result = f(input);
@@ -163,6 +171,10 @@ static LIBC_INLINE uint64_t throughput_baseline(
asm("" ::"llr"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (size_t i = 0; i < N; i++) {
T x = inputs1[i];
T y = inputs2[i];
@@ -195,6 +207,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
asm("" ::"llr"(start));
T result{};
+
+ #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+ #pragma clang loop unroll(disable)
+ #endif
for (size_t i = 0; i < N; i++) {
T x = inputs1[i];
T y = inputs2[i];
|
Here's what I get on my NVIDIA GeForce RTX 4070 Laptop GPU. [1/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalnum_benchmark
Running Suite: LlvmLibcIsAlNumGpuBenchmark
Benchmark | Cycles (Mean) | Stddev | Min | Max | Iterations | Threads |
------------------------------------------------------------------------------------------------------
IsAlnum | 53 | 0 | 53 | 53 | 11904 | 64 |
IsAlnumSingleThread | 53 | 0 | 53 | 53 | 186 | 1 |
IsAlnumSingleWave | 53 | 0 | 53 | 53 | 5952 | 32 |
IsAlnumCapital | 53 | 0 | 53 | 53 | 11904 | 64 |
IsAlnumNotAlnum | 43 | 0 | 43 | 43 | 11904 | 64 |
[2/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalpha_benchmark
Running Suite: LlvmLibcIsAlphaGpuBenchmark
Benchmark | Cycles (Mean) | Stddev | Min | Max | Iterations | Threads |
------------------------------------------------------------------------------------------------------
IsAlpha | 53 | 0 | 53 | 53 | 186 | 1 |
[3/4] Running hermetic test libc.benchmarks.gpu.src.math.sin_benchmark
Running Suite: LlvmLibcSinGpuBenchmark
Benchmark | Cycles (Mean) | Stddev | Min | Max | Iterations | Threads |
------------------------------------------------------------------------------------------------------
Sin_1 | 3122 | 153 | 2933 | 3607 | 2735008 | 32 |
Sin_128 | 2696 | 15 | 2651 | 2739 | 17024 | 32 |
Sin_1024 | 2881 | 5 | 2872 | 2890 | 1344 | 32 |
Sin_4096 | 2895 | 2 | 2891 | 2899 | 352 | 32 |
SinTwoPi_1 | 2219 | 12 | 2204 | 2517 | 24032 | 32 |
SinTwoPi_128 | 2047 | 2 | 2044 | 2051 | 1344 | 32 |
SinTwoPi_1024 | 2253 | 0 | 2253 | 2254 | 576 | 32 |
SinTwoPi_4096 | 2272 | 0 | 2272 | 2272 | 352 | 32 |
SinTwoPow30_1 | 3135 | 17 | 3111 | 3364 | 8480 | 32 |
SinTwoPow30_128 | 2734 | 1 | 2732 | 2736 | 352 | 32 |
SinTwoPow30_1024 | 2940 | 0 | 2940 | 2941 | 352 | 32 |
SinTwoPow30_4096 | 2958 | 0 | 2958 | 2959 | 352 | 32 |
SinVeryLarge_1 | 2858 | 16 | 2823 | 3093 | 8480 | 32 |
SinVeryLarge_128 | 2402 | 2 | 2398 | 2406 | 352 | 32 |
SinVeryLarge_1024 | 2599 | 0 | 2599 | 2600 | 352 | 32 |
SinVeryLarge_4096 | 2615 | 0 | 2615 | 2615 | 352 | 32 |
NvSin_1 | 2522 | 69 | 2261 | 2880 | 5952 | 32 |
NvSin_128 | 1826 | 2 | 1824 | 1830 | 576 | 32 |
NvSin_1024 | 2035 | 0 | 2035 | 2036 | 352 | 32 |
NvSin_4096 | 2053 | 0 | 2053 | 2053 | 352 | 32 |
NvSinTwoPi_1 | 1107 | 1 | 1104 | 1108 | 2880 | 32 |
NvSinTwoPi_128 | 891 | 0 | 891 | 891 | 352 | 32 |
NvSinTwoPi_1024 | 1102 | 0 | 1101 | 1102 | 352 | 32 |
NvSinTwoPi_4096 | 1122 | 0 | 1122 | 1122 | 352 | 32 |
NvSinTwoPow30_1 | 1106 | 1 | 1105 | 1108 | 1344 | 32 |
NvSinTwoPow30_128 | 891 | 0 | 891 | 891 | 352 | 32 |
NvSinTwoPow30_1024 | 1101 | 0 | 1101 | 1101 | 352 | 32 |
NvSinTwoPow30_4096 | 1122 | 0 | 1122 | 1122 | 352 | 32 |
NvSinVeryLarge_1 | 2497 | 23 | 2251 | 2845 | 12032 | 32 |
NvSinVeryLarge_128 | 1790 | 1 | 1789 | 1792 | 576 | 32 |
NvSinVeryLarge_1024 | 1999 | 0 | 1999 | 1999 | 352 | 32 |
NvSinVeryLarge_4096 | 2019 | 0 | 2019 | 2019 | 352 | 32 |
Sinf_1 | 2201 | 170 | 1522 | 2400 | 507776 | 32 |
Sinf_128 | 1872 | 13 | 1830 | 1898 | 2880 | 32 |
Sinf_1024 | 2056 | 5 | 2047 | 2068 | 1984 | 32 |
Sinf_4096 | 2093 | 3 | 2088 | 2098 | 352 | 32 |
SinfTwoPi_1 | 1442 | 11 | 1426 | 1759 | 33856 | 32 |
SinfTwoPi_128 | 1126 | 1 | 1125 | 1129 | 352 | 32 |
SinfTwoPi_1024 | 1314 | 0 | 1314 | 1315 | 352 | 32 |
SinfTwoPi_4096 | 1350 | 0 | 1350 | 1350 | 352 | 32 |
SinfTwoPow30_1 | 1088 | 10 | 1080 | 1162 | 1984 | 32 |
SinfTwoPow30_128 | 771 | 1 | 771 | 774 | 1984 | 32 |
SinfTwoPow30_1024 | 961 | 0 | 960 | 962 | 352 | 32 |
SinfTwoPow30_4096 | 997 | 0 | 997 | 997 | 352 | 32 |
SinfVeryLarge_1 | 1925 | 14 | 1869 | 2282 | 24032 | 32 |
SinfVeryLarge_128 | 1598 | 1 | 1598 | 1600 | 352 | 32 |
SinfVeryLarge_1024 | 1788 | 0 | 1787 | 1789 | 352 | 32 |
SinfVeryLarge_4096 | 1824 | 0 | 1824 | 1824 | 352 | 32 |
NvSinf_1 | 1024 | 6 | 1019 | 1043 | 1984 | 32 |
NvSinf_128 | 742 | 0 | 742 | 744 | 576 | 32 |
NvSinf_1024 | 932 | 0 | 932 | 933 | 352 | 32 |
NvSinf_4096 | 967 | 0 | 967 | 967 | 352 | 32 |
NvSinfTwoPi_1 | 162 | 3 | 162 | 497 | 362464 | 32 |
NvSinfTwoPi_128 | 107 | 0 | 107 | 109 | 2880 | 32 |
NvSinfTwoPi_1024 | 297 | 0 | 297 | 297 | 352 | 32 |
NvSinfTwoPi_4096 | 334 | 0 | 334 | 334 | 352 | 32 |
NvSinfTwoPow30_1 | 1026 | 11 | 1018 | 1281 | 33856 | 32 |
NvSinfTwoPow30_128 | 742 | 0 | 741 | 742 | 896 | 32 |
NvSinfTwoPow30_1024 | 931 | 0 | 931 | 931 | 352 | 32 |
NvSinfTwoPow30_4096 | 967 | 0 | 967 | 967 | 352 | 32 |
NvSinfVeryLarge_1 | 1003 | 1 | 1000 | 1004 | 1984 | 32 |
NvSinfVeryLarge_128 | 723 | 0 | 723 | 723 | 352 | 32 |
NvSinfVeryLarge_1024 | 913 | 0 | 913 | 913 | 352 | 32 |
NvSinfVeryLarge_4096 | 949 | 0 | 949 | 949 | 352 | 32 |
[4/4] Running hermetic test libc.benchmarks.gpu.src.math.atan2_benchmark
Running Suite: LlvmLibcAtan2GpuBenchmark
Benchmark | Cycles (Mean) | Stddev | Min | Max | Iterations | Threads |
------------------------------------------------------------------------------------------------------
Atan2_1 | 4082 | 954 | 1892 | 5271 | 24032 | 32 |
Atan2_128 | 3852 | 80 | 3531 | 4112 | 131648 | 32 |
Atan2_1024 | 4083 | 31 | 3991 | 4150 | 2880 | 32 |
Atan2_4096 | 4080 | 16 | 4058 | 4111 | 576 | 32 |
Atan2TwoPi_1 | 2738 | 16 | 2728 | 3162 | 24032 | 32 |
Atan2TwoPi_128 | 2511 | 2 | 2508 | 2515 | 352 | 32 |
Atan2TwoPi_1024 | 2743 | 0 | 2742 | 2743 | 352 | 32 |
Atan2TwoPi_4096 | 2744 | 0 | 2744 | 2745 | 352 | 32 |
Atan2TwoPow30_1 | 2734 | 15 | 2721 | 3148 | 24032 | 32 |
Atan2TwoPow30_128 | 2517 | 2 | 2512 | 2525 | 1344 | 32 |
Atan2TwoPow30_1024 | 2743 | 0 | 2743 | 2744 | 352 | 32 |
Atan2TwoPow30_4096 | 2744 | 0 | 2744 | 2744 | 352 | 32 |
Atan2Large_1 | 3570 | 382 | 1125 | 3882 | 131648 | 32 |
Atan2Large_128 | 3352 | 37 | 3280 | 3421 | 1984 | 32 |
Atan2Large_1024 | 3578 | 10 | 3554 | 3601 | 1984 | 32 |
Atan2Large_4096 | 3576 | 6 | 3566 | 3586 | 576 | 32 |
NvAtan2_1 | 2909 | 38 | 2866 | 3339 | 17024 | 32 |
NvAtan2_128 | 2801 | 2 | 2798 | 2805 | 352 | 32 |
NvAtan2_1024 | 3040 | 1 | 3039 | 3041 | 352 | 32 |
NvAtan2_4096 | 3041 | 1 | 3040 | 3042 | 352 | 32 |
NvAtan2TwoPi_1 | 2032 | 13 | 2032 | 2386 | 24032 | 32 |
NvAtan2TwoPi_128 | 1945 | 1 | 1945 | 1947 | 352 | 32 |
NvAtan2TwoPi_1024 | 2185 | 0 | 2184 | 2185 | 352 | 32 |
NvAtan2TwoPi_4096 | 2185 | 0 | 2185 | 2186 | 352 | 32 |
NvAtan2TwoPow30_1 | 2032 | 8 | 2032 | 2184 | 12032 | 32 |
NvAtan2TwoPow30_128 | 1945 | 1 | 1945 | 1951 | 896 | 32 |
NvAtan2TwoPow30_1024 | 2184 | 0 | 2184 | 2184 | 352 | 32 |
NvAtan2TwoPow30_4096 | 2185 | 0 | 2185 | 2186 | 352 | 32 |
NvAtan2Large_1 | 2032 | 12 | 2032 | 2359 | 24032 | 32 |
NvAtan2Large_128 | 1945 | 1 | 1945 | 1951 | 896 | 32 |
NvAtan2Large_1024 | 2184 | 0 | 2184 | 2185 | 352 | 32 |
NvAtan2Large_4096 | 2185 | 0 | 2185 | 2185 | 352 | 32 | |
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if we need an option for this, at long as it's consistent behavior.
I removed the option. |
This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop.
Motivation:
sin
, the generated PTX shows thethroughput
loop unrolled 8x atN=128
(one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch sizeN
grows.sin
dropped from ~3,100 cycles/call atN=1
to ~360 atN=128
. After enforcing#pragma clang loop unroll(disable)
, results stabilized (e.g., from ~3100 cycles/call atN=1
to ~2700 atN=128
).sin
path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract.What this change does:
#pragma clang loop unroll(disable)
to the GPUthroughput()
loop in both NVPTX and AMDGPU backends.Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields fairer, more consistent numbers.