Skip to content

Conversation

leandrolcampos
Copy link
Contributor

@leandrolcampos leandrolcampos commented Aug 16, 2025

This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop.

Motivation:

  • PTX (post-LTO) evidence on NVPTX: for libc sin, the generated PTX shows the throughput loop unrolled 8x at N=128 (one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch size N grows.
  • Observed scaling (NVPTX measurements): with unrolling enabled, sin dropped from ~3,100 cycles/call at N=1 to ~360 at N=128. After enforcing #pragma clang loop unroll(disable), results stabilized (e.g., from ~3100 cycles/call at N=1 to ~2700 at N=128).
  • libdevice contrast: the libdevice sin path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract.

What this change does:

  • Applies #pragma clang loop unroll(disable) to the GPU throughput() loop in both NVPTX and AMDGPU backends.

Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields fairer, more consistent numbers.

@llvmbot
Copy link
Member

llvmbot commented Aug 16, 2025

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Leandro Lacerda (leandrolcampos)

Changes

This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop by default. It also adds an opt-in switch for users who want to study instruction-level parallelism (ILP) effects.

Motivation:

  • PTX (post-LTO) evidence on NVPTX: for libc sin, the generated PTX shows the throughput loop unrolled 8x at N=128 (one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch size N grows.
  • Observed scaling (NVPTX measurements): with unrolling enabled, sin dropped from ~3,100 cycles/call at N=1 to ~360 at N=128. After enforcing #pragma clang loop unroll(disable), results stabilized (e.g., from ~3100 cycles/call at N=1 to ~2700 at N=128).
  • libdevice contrast: the libdevice sin path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract.

What this change does:

  • Applies #pragma clang loop unroll(disable) to the GPU throughput() loop in both NVPTX and AMDGPU backends.
  • Adds a build switch to re-enable unrolling for ILP studies: LIBC_GPU_BENCHMARKS_ALLOW_UNROLL (default is OFF)

Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling by default yields fairer, more consistent numbers; users can still opt in to unrolling to probe peak ILP.


Full diff: https://github.com/llvm/llvm-project/pull/153971.diff

3 Files Affected:

  • (modified) libc/benchmarks/gpu/CMakeLists.txt (+9)
  • (modified) libc/benchmarks/gpu/timing/amdgpu/timing.h (+16)
  • (modified) libc/benchmarks/gpu/timing/nvptx/timing.h (+16)
diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ca134b12a479..9e57d8e4590d6 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -2,6 +2,8 @@ add_subdirectory(timing)
 
 add_custom_target(gpu-benchmark)
 
+option(LIBC_GPU_BENCHMARKS_ALLOW_UNROLL "Allow compiler loop unrolling in throughput loops" OFF)
+
 function(add_benchmark benchmark_name)
   cmake_parse_arguments(
     "BENCHMARK"
@@ -14,6 +16,12 @@ function(add_benchmark benchmark_name)
   if(NOT libc.src.time.clock IN_LIST TARGET_LLVMLIBC_ENTRYPOINTS)
     message(FATAL_ERROR "target does not support clock")
   endif()
+
+  set(benchmark_extra_flags "")
+  if(NOT LIBC_GPU_BENCHMARKS_ALLOW_UNROLL)
+    list(APPEND benchmark_extra_flags "-DLIBC_GPU_BENCHMARKS_DISABLE_UNROLL=1")
+  endif()
+
   add_libc_hermetic(
     ${benchmark_name}
     IS_GPU_BENCHMARK
@@ -26,6 +34,7 @@ function(add_benchmark benchmark_name)
     ${BENCHMARK_UNPARSED_ARGUMENTS}
     COMPILE_OPTIONS
       -flto
+      ${benchmark_extra_flags}
   )
   get_fq_target_name(${benchmark_name} fq_target_name)
   set(fq_build_target_name ${fq_target_name}.__build__)
diff --git a/libc/benchmarks/gpu/timing/amdgpu/timing.h b/libc/benchmarks/gpu/timing/amdgpu/timing.h
index b4a174f729817..5c1d3a0582d45 100644
--- a/libc/benchmarks/gpu/timing/amdgpu/timing.h
+++ b/libc/benchmarks/gpu/timing/amdgpu/timing.h
@@ -117,6 +117,10 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"v"(input));
     result = input;
@@ -146,6 +150,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"v"(input));
     result = f(input);
@@ -174,6 +182,10 @@ static LIBC_INLINE uint64_t throughput_baseline(
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];
@@ -206,6 +218,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
   asm("" ::"s"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];
diff --git a/libc/benchmarks/gpu/timing/nvptx/timing.h b/libc/benchmarks/gpu/timing/nvptx/timing.h
index 0c93a67129b8d..e671e378c9e2e 100644
--- a/libc/benchmarks/gpu/timing/nvptx/timing.h
+++ b/libc/benchmarks/gpu/timing/nvptx/timing.h
@@ -106,6 +106,10 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"r"(input));
     result = input;
@@ -135,6 +139,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (auto input : inputs) {
     asm("" ::"r"(input));
     result = f(input);
@@ -163,6 +171,10 @@ static LIBC_INLINE uint64_t throughput_baseline(
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];
@@ -195,6 +207,10 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
   asm("" ::"llr"(start));
 
   T result{};
+
+  #if defined(LIBC_GPU_BENCHMARKS_DISABLE_UNROLL)
+  #pragma clang loop unroll(disable)
+  #endif
   for (size_t i = 0; i < N; i++) {
     T x = inputs1[i];
     T y = inputs2[i];

@leandrolcampos
Copy link
Contributor Author

Here's what I get on my NVIDIA GeForce RTX 4070 Laptop GPU.

[1/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalnum_benchmark
Running Suite: LlvmLibcIsAlNumGpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
IsAlnum                  |             53 |        0 |      53 |      53 |          11904 |       64 |
IsAlnumSingleThread      |             53 |        0 |      53 |      53 |            186 |        1 |
IsAlnumSingleWave        |             53 |        0 |      53 |      53 |           5952 |       32 |
IsAlnumCapital           |             53 |        0 |      53 |      53 |          11904 |       64 |
IsAlnumNotAlnum          |             43 |        0 |      43 |      43 |          11904 |       64 |
[2/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalpha_benchmark
Running Suite: LlvmLibcIsAlphaGpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
IsAlpha                  |             53 |        0 |      53 |      53 |            186 |        1 |
[3/4] Running hermetic test libc.benchmarks.gpu.src.math.sin_benchmark
Running Suite: LlvmLibcSinGpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
Sin_1                    |           3122 |      153 |    2933 |    3607 |        2735008 |       32 |
Sin_128                  |           2696 |       15 |    2651 |    2739 |          17024 |       32 |
Sin_1024                 |           2881 |        5 |    2872 |    2890 |           1344 |       32 |
Sin_4096                 |           2895 |        2 |    2891 |    2899 |            352 |       32 |
SinTwoPi_1               |           2219 |       12 |    2204 |    2517 |          24032 |       32 |
SinTwoPi_128             |           2047 |        2 |    2044 |    2051 |           1344 |       32 |
SinTwoPi_1024            |           2253 |        0 |    2253 |    2254 |            576 |       32 |
SinTwoPi_4096            |           2272 |        0 |    2272 |    2272 |            352 |       32 |
SinTwoPow30_1            |           3135 |       17 |    3111 |    3364 |           8480 |       32 |
SinTwoPow30_128          |           2734 |        1 |    2732 |    2736 |            352 |       32 |
SinTwoPow30_1024         |           2940 |        0 |    2940 |    2941 |            352 |       32 |
SinTwoPow30_4096         |           2958 |        0 |    2958 |    2959 |            352 |       32 |
SinVeryLarge_1           |           2858 |       16 |    2823 |    3093 |           8480 |       32 |
SinVeryLarge_128         |           2402 |        2 |    2398 |    2406 |            352 |       32 |
SinVeryLarge_1024        |           2599 |        0 |    2599 |    2600 |            352 |       32 |
SinVeryLarge_4096        |           2615 |        0 |    2615 |    2615 |            352 |       32 |
NvSin_1                  |           2522 |       69 |    2261 |    2880 |           5952 |       32 |
NvSin_128                |           1826 |        2 |    1824 |    1830 |            576 |       32 |
NvSin_1024               |           2035 |        0 |    2035 |    2036 |            352 |       32 |
NvSin_4096               |           2053 |        0 |    2053 |    2053 |            352 |       32 |
NvSinTwoPi_1             |           1107 |        1 |    1104 |    1108 |           2880 |       32 |
NvSinTwoPi_128           |            891 |        0 |     891 |     891 |            352 |       32 |
NvSinTwoPi_1024          |           1102 |        0 |    1101 |    1102 |            352 |       32 |
NvSinTwoPi_4096          |           1122 |        0 |    1122 |    1122 |            352 |       32 |
NvSinTwoPow30_1          |           1106 |        1 |    1105 |    1108 |           1344 |       32 |
NvSinTwoPow30_128        |            891 |        0 |     891 |     891 |            352 |       32 |
NvSinTwoPow30_1024       |           1101 |        0 |    1101 |    1101 |            352 |       32 |
NvSinTwoPow30_4096       |           1122 |        0 |    1122 |    1122 |            352 |       32 |
NvSinVeryLarge_1         |           2497 |       23 |    2251 |    2845 |          12032 |       32 |
NvSinVeryLarge_128       |           1790 |        1 |    1789 |    1792 |            576 |       32 |
NvSinVeryLarge_1024      |           1999 |        0 |    1999 |    1999 |            352 |       32 |
NvSinVeryLarge_4096      |           2019 |        0 |    2019 |    2019 |            352 |       32 |
Sinf_1                   |           2201 |      170 |    1522 |    2400 |         507776 |       32 |
Sinf_128                 |           1872 |       13 |    1830 |    1898 |           2880 |       32 |
Sinf_1024                |           2056 |        5 |    2047 |    2068 |           1984 |       32 |
Sinf_4096                |           2093 |        3 |    2088 |    2098 |            352 |       32 |
SinfTwoPi_1              |           1442 |       11 |    1426 |    1759 |          33856 |       32 |
SinfTwoPi_128            |           1126 |        1 |    1125 |    1129 |            352 |       32 |
SinfTwoPi_1024           |           1314 |        0 |    1314 |    1315 |            352 |       32 |
SinfTwoPi_4096           |           1350 |        0 |    1350 |    1350 |            352 |       32 |
SinfTwoPow30_1           |           1088 |       10 |    1080 |    1162 |           1984 |       32 |
SinfTwoPow30_128         |            771 |        1 |     771 |     774 |           1984 |       32 |
SinfTwoPow30_1024        |            961 |        0 |     960 |     962 |            352 |       32 |
SinfTwoPow30_4096        |            997 |        0 |     997 |     997 |            352 |       32 |
SinfVeryLarge_1          |           1925 |       14 |    1869 |    2282 |          24032 |       32 |
SinfVeryLarge_128        |           1598 |        1 |    1598 |    1600 |            352 |       32 |
SinfVeryLarge_1024       |           1788 |        0 |    1787 |    1789 |            352 |       32 |
SinfVeryLarge_4096       |           1824 |        0 |    1824 |    1824 |            352 |       32 |
NvSinf_1                 |           1024 |        6 |    1019 |    1043 |           1984 |       32 |
NvSinf_128               |            742 |        0 |     742 |     744 |            576 |       32 |
NvSinf_1024              |            932 |        0 |     932 |     933 |            352 |       32 |
NvSinf_4096              |            967 |        0 |     967 |     967 |            352 |       32 |
NvSinfTwoPi_1            |            162 |        3 |     162 |     497 |         362464 |       32 |
NvSinfTwoPi_128          |            107 |        0 |     107 |     109 |           2880 |       32 |
NvSinfTwoPi_1024         |            297 |        0 |     297 |     297 |            352 |       32 |
NvSinfTwoPi_4096         |            334 |        0 |     334 |     334 |            352 |       32 |
NvSinfTwoPow30_1         |           1026 |       11 |    1018 |    1281 |          33856 |       32 |
NvSinfTwoPow30_128       |            742 |        0 |     741 |     742 |            896 |       32 |
NvSinfTwoPow30_1024      |            931 |        0 |     931 |     931 |            352 |       32 |
NvSinfTwoPow30_4096      |            967 |        0 |     967 |     967 |            352 |       32 |
NvSinfVeryLarge_1        |           1003 |        1 |    1000 |    1004 |           1984 |       32 |
NvSinfVeryLarge_128      |            723 |        0 |     723 |     723 |            352 |       32 |
NvSinfVeryLarge_1024     |            913 |        0 |     913 |     913 |            352 |       32 |
NvSinfVeryLarge_4096     |            949 |        0 |     949 |     949 |            352 |       32 |
[4/4] Running hermetic test libc.benchmarks.gpu.src.math.atan2_benchmark
Running Suite: LlvmLibcAtan2GpuBenchmark
Benchmark                |  Cycles (Mean) |   Stddev |     Min |     Max |     Iterations |  Threads |
------------------------------------------------------------------------------------------------------
Atan2_1                  |           4082 |      954 |    1892 |    5271 |          24032 |       32 |
Atan2_128                |           3852 |       80 |    3531 |    4112 |         131648 |       32 |
Atan2_1024               |           4083 |       31 |    3991 |    4150 |           2880 |       32 |
Atan2_4096               |           4080 |       16 |    4058 |    4111 |            576 |       32 |
Atan2TwoPi_1             |           2738 |       16 |    2728 |    3162 |          24032 |       32 |
Atan2TwoPi_128           |           2511 |        2 |    2508 |    2515 |            352 |       32 |
Atan2TwoPi_1024          |           2743 |        0 |    2742 |    2743 |            352 |       32 |
Atan2TwoPi_4096          |           2744 |        0 |    2744 |    2745 |            352 |       32 |
Atan2TwoPow30_1          |           2734 |       15 |    2721 |    3148 |          24032 |       32 |
Atan2TwoPow30_128        |           2517 |        2 |    2512 |    2525 |           1344 |       32 |
Atan2TwoPow30_1024       |           2743 |        0 |    2743 |    2744 |            352 |       32 |
Atan2TwoPow30_4096       |           2744 |        0 |    2744 |    2744 |            352 |       32 |
Atan2Large_1             |           3570 |      382 |    1125 |    3882 |         131648 |       32 |
Atan2Large_128           |           3352 |       37 |    3280 |    3421 |           1984 |       32 |
Atan2Large_1024          |           3578 |       10 |    3554 |    3601 |           1984 |       32 |
Atan2Large_4096          |           3576 |        6 |    3566 |    3586 |            576 |       32 |
NvAtan2_1                |           2909 |       38 |    2866 |    3339 |          17024 |       32 |
NvAtan2_128              |           2801 |        2 |    2798 |    2805 |            352 |       32 |
NvAtan2_1024             |           3040 |        1 |    3039 |    3041 |            352 |       32 |
NvAtan2_4096             |           3041 |        1 |    3040 |    3042 |            352 |       32 |
NvAtan2TwoPi_1           |           2032 |       13 |    2032 |    2386 |          24032 |       32 |
NvAtan2TwoPi_128         |           1945 |        1 |    1945 |    1947 |            352 |       32 |
NvAtan2TwoPi_1024        |           2185 |        0 |    2184 |    2185 |            352 |       32 |
NvAtan2TwoPi_4096        |           2185 |        0 |    2185 |    2186 |            352 |       32 |
NvAtan2TwoPow30_1        |           2032 |        8 |    2032 |    2184 |          12032 |       32 |
NvAtan2TwoPow30_128      |           1945 |        1 |    1945 |    1951 |            896 |       32 |
NvAtan2TwoPow30_1024     |           2184 |        0 |    2184 |    2184 |            352 |       32 |
NvAtan2TwoPow30_4096     |           2185 |        0 |    2185 |    2186 |            352 |       32 |
NvAtan2Large_1           |           2032 |       12 |    2032 |    2359 |          24032 |       32 |
NvAtan2Large_128         |           1945 |        1 |    1945 |    1951 |            896 |       32 |
NvAtan2Large_1024        |           2184 |        0 |    2184 |    2185 |            352 |       32 |
NvAtan2Large_4096        |           2185 |        0 |    2185 |    2185 |            352 |       32 |

Copy link

github-actions bot commented Aug 16, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we need an option for this, at long as it's consistent behavior.

@leandrolcampos leandrolcampos changed the title [libc][gpu] Disable loop unrolling in the throughput benchmark loop by default [libc][gpu] Disable loop unrolling in the throughput benchmark loop Aug 16, 2025
@leandrolcampos
Copy link
Contributor Author

Unsure if we need an option for this, at long as it's consistent behavior.

I removed the option.

@jhuber6 jhuber6 enabled auto-merge (squash) August 16, 2025 20:07
@jhuber6 jhuber6 merged commit 75bf739 into llvm:main Aug 16, 2025
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants