Skip to content

Commit 75bf739

Browse files
[libc][gpu] Disable loop unrolling in the throughput benchmark loop (#153971)
This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop. Motivation: * PTX (post-LTO) evidence on NVPTX: for libc `sin`, the generated PTX shows the `throughput` loop unrolled 8x at `N=128` (one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch size `N` grows. * Observed scaling (NVPTX measurements): with unrolling enabled, `sin` dropped from ~3,100 cycles/call at `N=1` to ~360 at `N=128`. After enforcing `#pragma clang loop unroll(disable)`, results stabilized (e.g., from ~3100 cycles/call at `N=1` to ~2700 at `N=128`). * libdevice contrast: the libdevice `sin` path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract. What this change does: * Applies `#pragma clang loop unroll(disable)` to the GPU `throughput()` loop in both NVPTX and AMDGPU backends. Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields fairer, more consistent numbers.
1 parent 3acb679 commit 75bf739

File tree

2 files changed

+16
-0
lines changed

2 files changed

+16
-0
lines changed

libc/benchmarks/gpu/timing/amdgpu/timing.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
117117
asm("" ::"s"(start));
118118

119119
T result{};
120+
121+
#pragma clang loop unroll(disable)
120122
for (auto input : inputs) {
121123
asm("" ::"v"(input));
122124
result = input;
@@ -146,6 +148,8 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
146148
asm("" ::"s"(start));
147149

148150
T result{};
151+
152+
#pragma clang loop unroll(disable)
149153
for (auto input : inputs) {
150154
asm("" ::"v"(input));
151155
result = f(input);
@@ -174,6 +178,8 @@ static LIBC_INLINE uint64_t throughput_baseline(
174178
asm("" ::"s"(start));
175179

176180
T result{};
181+
182+
#pragma clang loop unroll(disable)
177183
for (size_t i = 0; i < N; i++) {
178184
T x = inputs1[i];
179185
T y = inputs2[i];
@@ -206,6 +212,8 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
206212
asm("" ::"s"(start));
207213

208214
T result{};
215+
216+
#pragma clang loop unroll(disable)
209217
for (size_t i = 0; i < N; i++) {
210218
T x = inputs1[i];
211219
T y = inputs2[i];

libc/benchmarks/gpu/timing/nvptx/timing.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,8 @@ throughput_baseline(const cpp::array<T, N> &inputs) {
106106
asm("" ::"llr"(start));
107107

108108
T result{};
109+
110+
#pragma clang loop unroll(disable)
109111
for (auto input : inputs) {
110112
asm("" ::"r"(input));
111113
result = input;
@@ -135,6 +137,8 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs) {
135137
asm("" ::"llr"(start));
136138

137139
T result{};
140+
141+
#pragma clang loop unroll(disable)
138142
for (auto input : inputs) {
139143
asm("" ::"r"(input));
140144
result = f(input);
@@ -163,6 +167,8 @@ static LIBC_INLINE uint64_t throughput_baseline(
163167
asm("" ::"llr"(start));
164168

165169
T result{};
170+
171+
#pragma clang loop unroll(disable)
166172
for (size_t i = 0; i < N; i++) {
167173
T x = inputs1[i];
168174
T y = inputs2[i];
@@ -195,6 +201,8 @@ static LIBC_INLINE uint64_t throughput(F f, const cpp::array<T, N> &inputs1,
195201
asm("" ::"llr"(start));
196202

197203
T result{};
204+
205+
#pragma clang loop unroll(disable)
198206
for (size_t i = 0; i < N; i++) {
199207
T x = inputs1[i];
200208
T y = inputs2[i];

0 commit comments

Comments
 (0)