[libc][gpu] Disable loop unrolling in the throughput benchmark loop #153971

leandrolcampos · 2025-08-16T18:45:49Z

This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop.

Motivation:

PTX (post-LTO) evidence on NVPTX: for libc sin, the generated PTX shows the throughput loop unrolled 8x at N=128 (one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch size N grows.
Observed scaling (NVPTX measurements): with unrolling enabled, sin dropped from ~3,100 cycles/call at N=1 to ~360 at N=128. After enforcing #pragma clang loop unroll(disable), results stabilized (e.g., from ~3100 cycles/call at N=1 to ~2700 at N=128).
libdevice contrast: the libdevice sin path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract.

What this change does:

Applies #pragma clang loop unroll(disable) to the GPU throughput() loop in both NVPTX and AMDGPU backends.

Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields fairer, more consistent numbers.

llvmbot · 2025-08-16T18:46:24Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Leandro Lacerda (leandrolcampos)

Changes

This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop by default. It also adds an opt-in switch for users who want to study instruction-level parallelism (ILP) effects.

Motivation:

PTX (post-LTO) evidence on NVPTX: for libc sin, the generated PTX shows the throughput loop unrolled 8x at N=128 (one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch size N grows.
Observed scaling (NVPTX measurements): with unrolling enabled, sin dropped from ~3,100 cycles/call at N=1 to ~360 at N=128. After enforcing #pragma clang loop unroll(disable), results stabilized (e.g., from ~3100 cycles/call at N=1 to ~2700 at N=128).
libdevice contrast: the libdevice sin path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract.

What this change does:

Applies #pragma clang loop unroll(disable) to the GPU throughput() loop in both NVPTX and AMDGPU backends.
Adds a build switch to re-enable unrolling for ILP studies: LIBC_GPU_BENCHMARKS_ALLOW_UNROLL (default is OFF)

Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling by default yields fairer, more consistent numbers; users can still opt in to unrolling to probe peak ILP.

Full diff: https://github.com/llvm/llvm-project/pull/153971.diff

3 Files Affected:

(modified) libc/benchmarks/gpu/CMakeLists.txt (+9)
(modified) libc/benchmarks/gpu/timing/amdgpu/timing.h (+16)
(modified) libc/benchmarks/gpu/timing/nvptx/timing.h (+16)

diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ca134b12a479..9e57d8e4590d6 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -2,6 +2,8 @@ add_subdirectory(timing)
 
 add_custom_target(gpu-benchmark)
 
+option(LIBC_GPU_BENCHMARKS_ALLOW_UNROLL "Allow compiler loop unrolling in throughput loops" OFF)
+
 function(add_benchmark benchmark_name)
   cmake_parse_arguments(
     "BENCHMARK"
@@ -14,6 +16,12 @@ function(add_benchmark benchmark_name)
   if(NOT libc.src.time.clock IN_LIST TARGET_LLVMLIBC_ENTRYPOINTS)
     message(FATAL_ERROR "target does not support clock")
   endif()
+
+  set(benchmark_extra_flags "")
+  if(NOT LIBC_GPU_BENCHMARKS_ALLOW_UNROLL)
+    list(APPEND benchmark_extra_flags "-DLIBC_GPU_BENCHMARKS_DISABLE_UNROLL=1")
+  endif()
+
   add_libc_hermetic(
     ${benchmark_name}
     IS_GPU_BENCHMARK
@@ -26,6 +34,7 @@ function(add_benchmark benchmark_name)
     ${BENCHMARK_UNPARSED_ARGUMENTS}
     COMPILE_OPTIONS
       -flto
+      ${benchmark_extra_flags}
   )
   get_fq_target_name(${benchmark_name} fq_target_name)
   set(fq_build_target_name ${fq_target_name}.__build__)
diff --git a/libc/benchmarks/gpu/timing/amdgpu/timing.h b/libc/benchmarks/gpu/timing/amdgpu/timing.h
index b4a174f729817..5c1d3a0582d45 100644
--- a/libc/benchmarks/gpu/timing/amdgpu/timing.h
+++ b/libc/benchmarks/gpu/timing/amdgpu/timing.h
@@ -117,6 +117,10 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"v"(input));
     result = input;
@@ -146,6 +150,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"v"(input));
     result = f(input);
@@ -174,6 +182,10 @@ static LIBC_INLINE uint64_t throughput_baseline(
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];
@@ -206,6 +218,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];
diff --git a/libc/benchmarks/gpu/timing/nvptx/timing.h b/libc/benchmarks/gpu/timing/nvptx/timing.h
index 0c93a67129b8d..e671e378c9e2e 100644
--- a/libc/benchmarks/gpu/timing/nvptx/timing.h
+++ b/libc/benchmarks/gpu/timing/nvptx/timing.h
@@ -106,6 +106,10 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"r"(input));
     result = input;
@@ -135,6 +139,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"r"(input));
     result = f(input);
@@ -163,6 +171,10 @@ static LIBC_INLINE uint64_t throughput_baseline(
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];
@@ -195,6 +207,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];

leandrolcampos · 2025-08-16T18:46:46Z

Here's what I get on my NVIDIA GeForce RTX 4070 Laptop GPU.

[1/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalnum_benchmark
Running Suite: LlvmLibcIsAlNumGpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
IsAlnum                  |             53 |        0 |      53 |      53 |          11904 |       64 |
IsAlnumSingleThread      |             53 |        0 |      53 |      53 |            186 |        1 |
IsAlnumSingleWave        |             53 |        0 |      53 |      53 |           5952 |       32 |
IsAlnumCapital           |             53 |        0 |      53 |      53 |          11904 |       64 |
IsAlnumNotAlnum          |             43 |        0 |      43 |      43 |          11904 |       64 |
[2/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalpha_benchmark
Running Suite: LlvmLibcIsAlphaGpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
IsAlpha                  |             53 |        0 |      53 |      53 |            186 |        1 |
[3/4] Running hermetic test libc.benchmarks.gpu.src.math.sin_benchmark
Running Suite: LlvmLibcSinGpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
Sin_1                    |           3122 |      153 |    2933 |    3607 |        2735008 |       32 |
Sin_128                  |           2696 |       15 |    2651 |    2739 |          17024 |       32 |
Sin_1024                 |           2881 |        5 |    2872 |    2890 |           1344 |       32 |
Sin_4096                 |           2895 |        2 |    2891 |    2899 |            352 |       32 |
SinTwoPi_1               |           2219 |       12 |    2204 |    2517 |          24032 |       32 |
SinTwoPi_128             |           2047 |        2 |    2044 |    2051 |           1344 |       32 |
SinTwoPi_1024            |           2253 |        0 |    2253 |    2254 |            576 |       32 |
SinTwoPi_4096            |           2272 |        0 |    2272 |    2272 |            352 |       32 |
SinTwoPow30_1            |           3135 |       17 |    3111 |    3364 |           8480 |       32 |
SinTwoPow30_128          |           2734 |        1 |    2732 |    2736 |            352 |       32 |
SinTwoPow30_1024         |           2940 |        0 |    2940 |    2941 |            352 |       32 |
SinTwoPow30_4096         |           2958 |        0 |    2958 |    2959 |            352 |       32 |
SinVeryLarge_1           |           2858 |       16 |    2823 |    3093 |           8480 |       32 |
SinVeryLarge_128         |           2402 |        2 |    2398 |    2406 |            352 |       32 |
SinVeryLarge_1024        |           2599 |        0 |    2599 |    2600 |            352 |       32 |
SinVeryLarge_4096        |           2615 |        0 |    2615 |    2615 |            352 |       32 |
NvSin_1                  |           2522 |       69 |    2261 |    2880 |           5952 |       32 |
NvSin_128                |           1826 |        2 |    1824 |    1830 |            576 |       32 |
NvSin_1024               |           2035 |        0 |    2035 |    2036 |            352 |       32 |
NvSin_4096               |           2053 |        0 |    2053 |    2053 |            352 |       32 |
NvSinTwoPi_1             |           1107 |        1 |    1104 |    1108 |           2880 |       32 |
NvSinTwoPi_128           |            891 |        0 |     891 |     891 |            352 |       32 |
NvSinTwoPi_1024          |           1102 |        0 |    1101 |    1102 |            352 |       32 |
NvSinTwoPi_4096          |           1122 |        0 |    1122 |    1122 |            352 |       32 |
NvSinTwoPow30_1          |           1106 |        1 |    1105 |    1108 |           1344 |       32 |
NvSinTwoPow30_128        |            891 |        0 |     891 |     891 |            352 |       32 |
NvSinTwoPow30_1024       |           1101 |        0 |    1101 |    1101 |            352 |       32 |
NvSinTwoPow30_4096       |           1122 |        0 |    1122 |    1122 |            352 |       32 |
NvSinVeryLarge_1         |           2497 |       23 |    2251 |    2845 |          12032 |       32 |
NvSinVeryLarge_128       |           1790 |        1 |    1789 |    1792 |            576 |       32 |
NvSinVeryLarge_1024      |           1999 |        0 |    1999 |    1999 |            352 |       32 |
NvSinVeryLarge_4096      |           2019 |        0 |    2019 |    2019 |            352 |       32 |
Sinf_1                   |           2201 |      170 |    1522 |    2400 |         507776 |       32 |
Sinf_128                 |           1872 |       13 |    1830 |    1898 |           2880 |       32 |
Sinf_1024                |           2056 |        5 |    2047 |    2068 |           1984 |       32 |
Sinf_4096                |           2093 |        3 |    2088 |    2098 |            352 |       32 |
SinfTwoPi_1              |           1442 |       11 |    1426 |    1759 |          33856 |       32 |
SinfTwoPi_128            |           1126 |        1 |    1125 |    1129 |            352 |       32 |
SinfTwoPi_1024           |           1314 |        0 |    1314 |    1315 |            352 |       32 |
SinfTwoPi_4096           |           1350 |        0 |    1350 |    1350 |            352 |       32 |
SinfTwoPow30_1           |           1088 |       10 |    1080 |    1162 |           1984 |       32 |
SinfTwoPow30_128         |            771 |        1 |     771 |     774 |           1984 |       32 |
SinfTwoPow30_1024        |            961 |        0 |     960 |     962 |            352 |       32 |
SinfTwoPow30_4096        |            997 |        0 |     997 |     997 |            352 |       32 |
SinfVeryLarge_1          |           1925 |       14 |    1869 |    2282 |          24032 |       32 |
SinfVeryLarge_128        |           1598 |        1 |    1598 |    1600 |            352 |       32 |
SinfVeryLarge_1024       |           1788 |        0 |    1787 |    1789 |            352 |       32 |
SinfVeryLarge_4096       |           1824 |        0 |    1824 |    1824 |            352 |       32 |
NvSinf_1                 |           1024 |        6 |    1019 |    1043 |           1984 |       32 |
NvSinf_128               |            742 |        0 |     742 |     744 |            576 |       32 |
NvSinf_1024              |            932 |        0 |     932 |     933 |            352 |       32 |
NvSinf_4096              |            967 |        0 |     967 |     967 |            352 |       32 |
NvSinfTwoPi_1            |            162 |        3 |     162 |     497 |         362464 |       32 |
NvSinfTwoPi_128          |            107 |        0 |     107 |     109 |           2880 |       32 |
NvSinfTwoPi_1024         |            297 |        0 |     297 |     297 |            352 |       32 |
NvSinfTwoPi_4096         |            334 |        0 |     334 |     334 |            352 |       32 |
NvSinfTwoPow30_1         |           1026 |       11 |    1018 |    1281 |          33856 |       32 |
NvSinfTwoPow30_128       |            742 |        0 |     741 |     742 |            896 |       32 |
NvSinfTwoPow30_1024      |            931 |        0 |     931 |     931 |            352 |       32 |
NvSinfTwoPow30_4096      |            967 |        0 |     967 |     967 |            352 |       32 |
NvSinfVeryLarge_1        |           1003 |        1 |    1000 |    1004 |           1984 |       32 |
NvSinfVeryLarge_128      |            723 |        0 |     723 |     723 |            352 |       32 |
NvSinfVeryLarge_1024     |            913 |        0 |     913 |     913 |            352 |       32 |
NvSinfVeryLarge_4096     |            949 |        0 |     949 |     949 |            352 |       32 |
[4/4] Running hermetic test libc.benchmarks.gpu.src.math.atan2_benchmark
Running Suite: LlvmLibcAtan2GpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
Atan2_1                  |           4082 |      954 |    1892 |    5271 |          24032 |       32 |
Atan2_128                |           3852 |       80 |    3531 |    4112 |         131648 |       32 |
Atan2_1024               |           4083 |       31 |    3991 |    4150 |           2880 |       32 |
Atan2_4096               |           4080 |       16 |    4058 |    4111 |            576 |       32 |
Atan2TwoPi_1             |           2738 |       16 |    2728 |    3162 |          24032 |       32 |
Atan2TwoPi_128           |           2511 |        2 |    2508 |    2515 |            352 |       32 |
Atan2TwoPi_1024          |           2743 |        0 |    2742 |    2743 |            352 |       32 |
Atan2TwoPi_4096          |           2744 |        0 |    2744 |    2745 |            352 |       32 |
Atan2TwoPow30_1          |           2734 |       15 |    2721 |    3148 |          24032 |       32 |
Atan2TwoPow30_128        |           2517 |        2 |    2512 |    2525 |           1344 |       32 |
Atan2TwoPow30_1024       |           2743 |        0 |    2743 |    2744 |            352 |       32 |
Atan2TwoPow30_4096       |           2744 |        0 |    2744 |    2744 |            352 |       32 |
Atan2Large_1             |           3570 |      382 |    1125 |    3882 |         131648 |       32 |
Atan2Large_128           |           3352 |       37 |    3280 |    3421 |           1984 |       32 |
Atan2Large_1024          |           3578 |       10 |    3554 |    3601 |           1984 |       32 |
Atan2Large_4096          |           3576 |        6 |    3566 |    3586 |            576 |       32 |
NvAtan2_1                |           2909 |       38 |    2866 |    3339 |          17024 |       32 |
NvAtan2_128              |           2801 |        2 |    2798 |    2805 |            352 |       32 |
NvAtan2_1024             |           3040 |        1 |    3039 |    3041 |            352 |       32 |
NvAtan2_4096             |           3041 |        1 |    3040 |    3042 |            352 |       32 |
NvAtan2TwoPi_1           |           2032 |       13 |    2032 |    2386 |          24032 |       32 |
NvAtan2TwoPi_128         |           1945 |        1 |    1945 |    1947 |            352 |       32 |
NvAtan2TwoPi_1024        |           2185 |        0 |    2184 |    2185 |            352 |       32 |
NvAtan2TwoPi_4096        |           2185 |        0 |    2185 |    2186 |            352 |       32 |
NvAtan2TwoPow30_1        |           2032 |        8 |    2032 |    2184 |          12032 |       32 |
NvAtan2TwoPow30_128      |           1945 |        1 |    1945 |    1951 |            896 |       32 |
NvAtan2TwoPow30_1024     |           2184 |        0 |    2184 |    2184 |            352 |       32 |
NvAtan2TwoPow30_4096     |           2185 |        0 |    2185 |    2186 |            352 |       32 |
NvAtan2Large_1           |           2032 |       12 |    2032 |    2359 |          24032 |       32 |
NvAtan2Large_128         |           1945 |        1 |    1945 |    1951 |            896 |       32 |
NvAtan2Large_1024        |           2184 |        0 |    2184 |    2185 |            352 |       32 |
NvAtan2Large_4096        |           2185 |        0 |    2185 |    2185 |            352 |       32 |

github-actions · 2025-08-16T18:48:54Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jhuber6

Unsure if we need an option for this, at long as it's consistent behavior.

…ally

leandrolcampos · 2025-08-16T20:06:56Z

Unsure if we need an option for this, at long as it's consistent behavior.

I removed the option.

Disable loop unrolling in the throughput benchmark loop by default

c227a4b

llvmbot added backend:AMDGPU libc labels Aug 16, 2025

jhuber6 reviewed Aug 16, 2025

View reviewed changes

Remove unroll toggle; make throughput loop non-unrolled uncondition…

4da8053

…ally

leandrolcampos changed the title ~~[libc][gpu] Disable loop unrolling in the throughput benchmark loop by default~~ [libc][gpu] Disable loop unrolling in the throughput benchmark loop Aug 16, 2025

jhuber6 approved these changes Aug 16, 2025

View reviewed changes

jhuber6 enabled auto-merge (squash) August 16, 2025 20:07

jhuber6 merged commit 75bf739 into llvm:main Aug 16, 2025
17 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc][gpu] Disable loop unrolling in the throughput benchmark loop #153971

[libc][gpu] Disable loop unrolling in the throughput benchmark loop #153971

Uh oh!

leandrolcampos commented Aug 16, 2025 •

edited

Loading

Uh oh!

llvmbot commented Aug 16, 2025 •

edited

Loading

Uh oh!

leandrolcampos commented Aug 16, 2025

Uh oh!

github-actions bot commented Aug 16, 2025 •

edited

Loading

Uh oh!

jhuber6 left a comment

Uh oh!

leandrolcampos commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

[libc][gpu] Disable loop unrolling in the throughput benchmark loop #153971

[libc][gpu] Disable loop unrolling in the throughput benchmark loop #153971

Uh oh!

Conversation

leandrolcampos commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leandrolcampos commented Aug 16, 2025

Uh oh!

github-actions bot commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

leandrolcampos commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

leandrolcampos commented Aug 16, 2025 •

edited

Loading

llvmbot commented Aug 16, 2025 •

edited

Loading

github-actions bot commented Aug 16, 2025 •

edited

Loading