[libc] Improve GPU benchmarking #153512

leandrolcampos · 2025-08-14T00:29:27Z

This patch improves the GPU benchmarking in this way:

Replace rand/srand with a deterministic per-thread RNG seeded by call_index: reproducible, apples-to-apples libc vs vendor comparisons.
Fix input generation: sample the unbiased exponent uniformly in [min_exp, max_exp], clamp bounds, and skip Inf, NaN, -0.0, and +0.0.
Fix standard deviation: use an explicit estimator from sums and sums-of-squares (sqrt(E[x^2] − E[x]^2)) across samples.
Fix throughput overhead: subtract a loop-only baseline inside NVPTX/AMDGPU timing backends so benchmark() gets cycles-per-call already corrected (no overhead() call).
Adapt existing math benchmarks to the new RNG/timing plumbing (plumb call_index, drop rand/srand, clean includes).
Correct inter-thread aggregation: use iteration-weighted pooling to compute the global mean/variance, ensuring statistically sound Cycles (Mean) and Stddev.
Remove Time / Iteration column from the results table: it reported per-thread convergence time (not per-call latency) and was redundant/misleading next to Cycles (Mean).

…d fairness

leandrolcampos · 2025-08-14T00:33:07Z

Preliminary Results (NVIDIA GeForce RTX 4070 Laptop GPU)

[1/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalnum_benchmark
Running Suite: LlvmLibcIsAlNumGpuBenchmark
Benchmark                |  Cycles |     Min |     Max | Iterations | Time / Iteration |   Stddev |  Threads |
--------------------------------------------------------------------------------------------------------------
IsAlnum                  |      53 |      53 |      53 |        156 |             3 us |        0 |       64 |
IsAlnumSingleThread      |      53 |      53 |      53 |        157 |             3 us |        0 |        1 |
IsAlnumSingleWave        |      53 |      53 |      53 |        155 |             3 us |        0 |       32 |
IsAlnumCapital           |      53 |      53 |      53 |        157 |             3 us |        0 |       64 |
IsAlnumNotAlnum          |      43 |      43 |      43 |        163 |             3 us |        0 |       64 |
[2/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalpha_benchmark
Running Suite: LlvmLibcIsAlphaGpuBenchmark
Benchmark                |  Cycles |     Min |     Max | Iterations | Time / Iteration |   Stddev |  Threads |
--------------------------------------------------------------------------------------------------------------
IsAlpha                  |      53 |      53 |      53 |        156 |             3 us |        0 |        1 |
[3/4] Running hermetic test libc.benchmarks.gpu.src.math.sin_benchmark
Running Suite: LlvmLibcSinGpuBenchmark
Benchmark                |  Cycles |     Min |     Max | Iterations | Time / Iteration |   Stddev |  Threads |
--------------------------------------------------------------------------------------------------------------
Sin_1                    |    3087 |    2946 |    3637 |        202 |            17 us |      159 |       32 |
Sin_128                  |     362 |     354 |     372 |         26 |            64 us |        5 |       32 |
Sin_1024                 |     352 |     348 |     358 |         23 |           405 us |        2 |       32 |
Sin_4096                 |     359 |     358 |     361 |          7 |             1 ms |        1 |       32 |
SinTwoPi_1               |    2205 |    2186 |    2506 |         29 |            17 us |       56 |       32 |
SinTwoPi_128             |     262 |     259 |     267 |         10 |            52 us |        2 |       32 |
SinTwoPi_1024            |     271 |     271 |     275 |         16 |           319 us |        0 |       32 |
SinTwoPi_4096            |     280 |     280 |     281 |          9 |             1 ms |        0 |       32 |
SinTwoPow30_1            |    3104 |    3086 |    3174 |         28 |            18 us |       16 |       32 |
SinTwoPow30_128          |     348 |     345 |     352 |          9 |            60 us |        1 |       32 |
SinTwoPow30_1024         |     358 |     357 |     359 |          7 |           380 us |        0 |       32 |
SinTwoPow30_4096         |     366 |     366 |     367 |          6 |             1 ms |        0 |       32 |
SinVeryLarge_1           |    2827 |    2788 |    3069 |         29 |            17 us |       46 |       32 |
SinVeryLarge_128         |     316 |     313 |     318 |         14 |            57 us |        1 |       32 |
SinVeryLarge_1024        |     316 |     315 |     320 |         16 |           348 us |        1 |       32 |
SinVeryLarge_4096        |     324 |     323 |     325 |         15 |             1 ms |        0 |       32 |
NvSin_1                  |    2507 |    2262 |    2890 |         39 |            15 us |       95 |       32 |
NvSin_128                |    1862 |    1858 |    1870 |          5 |           145 us |        4 |       32 |
NvSin_1024               |    2066 |    2066 |    2068 |          5 |             1 ms |        0 |       32 |
NvSin_4096               |    2085 |    2085 |    2085 |          4 |             4 ms |        0 |       32 |
NvSinTwoPi_1             |    1103 |    1102 |    1105 |         35 |            14 us |        0 |       32 |
NvSinTwoPi_128           |     925 |     925 |     927 |          7 |            82 us |        0 |       32 |
NvSinTwoPi_1024          |    1134 |    1134 |    1134 |          4 |           665 us |        0 |       32 |
NvSinTwoPi_4096          |    1153 |    1153 |    1153 |          4 |             2 ms |        0 |       32 |
NvSinTwoPow30_1          |    1103 |    1102 |    1104 |         35 |            14 us |        0 |       32 |
NvSinTwoPow30_128        |     925 |     925 |     925 |          7 |            82 us |        0 |       32 |
NvSinTwoPow30_1024       |    1134 |    1134 |    1134 |          4 |           668 us |        0 |       32 |
NvSinTwoPow30_4096       |    1153 |    1153 |    1153 |          4 |             2 ms |        0 |       32 |
NvSinVeryLarge_1         |    2493 |    2470 |    2795 |         38 |            15 us |       50 |       32 |
NvSinVeryLarge_128       |    1827 |    1827 |    1829 |          5 |           141 us |        0 |       32 |
NvSinVeryLarge_1024      |    2033 |    2033 |    2034 |          5 |             1 ms |        0 |       32 |
NvSinVeryLarge_4096      |    2050 |    2050 |    2050 |          4 |             4 ms |        0 |       32 |
Sinf_1                   |    2190 |    1524 |    2396 |        527 |            14 us |      174 |       32 |
Sinf_128                 |     239 |     229 |     247 |         26 |            40 us |        4 |       32 |
Sinf_1024                |     241 |     236 |     249 |          8 |           233 us |        3 |       32 |
Sinf_4096                |     259 |     258 |     261 |          8 |           905 us |        1 |       32 |
SinfTwoPi_1              |    1447 |    1430 |    1753 |         39 |            14 us |       49 |       32 |
SinfTwoPi_128            |     147 |     146 |     149 |         19 |            34 us |        0 |       32 |
SinfTwoPi_1024           |     146 |     145 |     148 |         13 |           183 us |        0 |       32 |
SinfTwoPi_4096           |     165 |     165 |     167 |         23 |           704 us |        0 |       32 |
SinfTwoPow30_1           |    1084 |    1078 |    1163 |         35 |            14 us |       13 |       32 |
SinfTwoPow30_128         |     102 |     101 |     104 |         32 |            32 us |        0 |       32 |
SinfTwoPow30_1024        |     102 |     102 |     103 |         25 |           164 us |        0 |       32 |
SinfTwoPow30_4096        |     121 |     121 |     123 |         17 |           645 us |        0 |       32 |
SinfVeryLarge_1          |    1930 |    1870 |    2268 |         34 |            15 us |       59 |       32 |
SinfVeryLarge_128        |     205 |     205 |     207 |         18 |            38 us |        0 |       32 |
SinfVeryLarge_1024       |     205 |     205 |     207 |         10 |           218 us |        0 |       32 |
SinfVeryLarge_4096       |     224 |     224 |     226 |         14 |           845 us |        0 |       32 |
NvSinf_1                 |    1020 |    1016 |    1032 |         37 |            13 us |        5 |       32 |
NvSinf_128               |     786 |     786 |     788 |          7 |            76 us |        0 |       32 |
NvSinf_1024              |     974 |     969 |     976 |         17 |           588 us |        2 |       32 |
NvSinf_4096              |    1008 |    1008 |    1009 |          4 |             2 ms |        0 |       32 |
NvSinfTwoPi_1            |     164 |     162 |     505 |        145 |            13 us |       28 |       32 |
NvSinfTwoPi_128          |     141 |     141 |     143 |         15 |            33 us |        0 |       32 |
NvSinfTwoPi_1024         |     330 |     330 |     331 |          7 |           272 us |        0 |       32 |
NvSinfTwoPi_4096         |     364 |     364 |     365 |          6 |             1 ms |        0 |       32 |
NvSinfTwoPow30_1         |    1024 |    1016 |    1272 |         64 |            14 us |       31 |       32 |
NvSinfTwoPow30_128       |     776 |     776 |     776 |          7 |            73 us |        0 |       32 |
NvSinfTwoPow30_1024      |     968 |     966 |     969 |          7 |           504 us |        1 |       32 |
NvSinfTwoPow30_4096      |    1002 |    1002 |    1002 |          4 |             1 ms |        0 |       32 |
NvSinfVeryLarge_1        |    1003 |    1001 |    1026 |         39 |            13 us |        3 |       32 |
NvSinfVeryLarge_128      |     758 |     758 |     758 |          9 |            60 us |        0 |       32 |
NvSinfVeryLarge_1024     |     950 |     950 |     951 |          4 |           478 us |        0 |       32 |
NvSinfVeryLarge_4096     |     983 |     983 |     984 |          4 |             1 ms |        0 |       32 |
[4/4] Running hermetic test libc.benchmarks.gpu.src.math.atan2_benchmark
Running Suite: LlvmLibcAtan2GpuBenchmark
Benchmark                |  Cycles |     Min |     Max | Iterations | Time / Iteration |   Stddev |  Threads |
--------------------------------------------------------------------------------------------------------------
Atan2_1                  |    4082 |    1894 |    5241 |        723 |            14 us |      953 |       32 |
Atan2_128                |    2520 |    2454 |    2580 |         21 |           165 us |       33 |       32 |
Atan2_1024               |    2745 |    2723 |    2768 |         11 |             1 ms |       13 |       32 |
Atan2_4096               |    2750 |    2739 |    2761 |         11 |             5 ms |        6 |       32 |
Atan2TwoPi_1             |    2749 |    2731 |    3160 |         36 |            14 us |       69 |       32 |
Atan2TwoPi_128           |    1072 |    1065 |    1097 |         10 |            82 us |        8 |       32 |
Atan2TwoPi_1024          |    1302 |    1301 |    1304 |          4 |           668 us |        1 |       32 |
Atan2TwoPi_4096          |    1303 |    1303 |    1303 |          4 |             2 ms |        0 |       32 |
Atan2TwoPow30_1          |    2744 |    2729 |    3177 |         39 |            13 us |       70 |       32 |
Atan2TwoPow30_128        |    1075 |    1069 |    1101 |         10 |            84 us |        8 |       32 |
Atan2TwoPow30_1024       |    1302 |    1302 |    1304 |          4 |           677 us |        0 |       32 |
Atan2TwoPow30_4096       |    1303 |    1303 |    1304 |          4 |             2 ms |        0 |       32 |
Atan2Large_1             |    3577 |    1125 |    3888 |        142 |            14 us |      361 |       32 |
Atan2Large_128           |    1810 |    1770 |    1841 |         12 |           124 us |       17 |       32 |
Atan2Large_1024          |    2053 |    2050 |    2057 |          5 |           973 us |        2 |       32 |
Atan2Large_4096          |    2051 |    2047 |    2054 |          8 |             3 ms |        2 |       32 |
NvAtan2_1                |    2911 |    2866 |    3324 |         56 |            14 us |       64 |       32 |
NvAtan2_128              |    2838 |    2834 |    2849 |          6 |           180 us |        5 |       32 |
NvAtan2_1024             |    3075 |    3075 |    3077 |          4 |             1 ms |        0 |       32 |
NvAtan2_4096             |    3076 |    3076 |    3076 |          4 |             5 ms |        0 |       32 |
NvAtan2TwoPi_1           |    2040 |    2032 |    2382 |         42 |            13 us |       53 |       32 |
NvAtan2TwoPi_128         |    1980 |    1979 |    1993 |          9 |           130 us |        4 |       32 |
NvAtan2TwoPi_1024        |    2219 |    2219 |    2219 |          4 |             1 ms |        0 |       32 |
NvAtan2TwoPi_4096        |    2219 |    2219 |    2219 |          4 |             4 ms |        0 |       32 |
NvAtan2TwoPow30_1        |    2035 |    2032 |    2183 |         38 |            13 us |       24 |       32 |
NvAtan2TwoPow30_128      |    1980 |    1979 |    1993 |          9 |           132 us |        4 |       32 |
NvAtan2TwoPow30_1024     |    2218 |    2218 |    2219 |          5 |             1 ms |        0 |       32 |
NvAtan2TwoPow30_4096     |    2219 |    2219 |    2219 |          4 |             4 ms |        0 |       32 |
NvAtan2Large_1           |    2039 |    2032 |    2356 |         41 |            13 us |       49 |       32 |
NvAtan2Large_128         |    1980 |    1979 |    1998 |         11 |           132 us |        5 |       32 |
NvAtan2Large_1024        |    2218 |    2218 |    2219 |          4 |             1 ms |        0 |       32 |
NvAtan2Large_4096        |    2219 |    2219 |    2220 |          4 |             4 ms |        0 |       32 |

jhuber6 · 2025-08-14T18:27:50Z

Here's what I get on my AMD GPU.

[32/32] Running hermetic test libc.benchmarks.gpu.src.math.sin_benchmark
Running Suite: LlvmLibcSinGpuBenchmark
Benchmark                |  Cycles |     Min |     Max | Iterations | Time / Iteration |   Stddev |  Threads |
--------------------------------------------------------------------------------------------------------------
Sin_1                    |    1790 |    1710 |    2999 |        327 |             2 us |      106 |       32 |
Sin_128                  |    1565 |    1547 |    1615 |          7 |           153 us |       21 |       32 |
Sin_1024                 |    1552 |    1550 |    1556 |          5 |             1 ms |        1 |       32 |
Sin_4096                 |    1551 |    1551 |    1552 |          4 |             4 ms |        0 |       32 |
SinTwoPi_1               |    1146 |    1142 |    2109 |        235 |             2 us |       62 |       32 |
SinTwoPi_128             |    1078 |    1076 |    1127 |         21 |           126 us |       10 |       32 |
SinTwoPi_1024            |    1076 |    1076 |    1082 |          8 |             1 ms |        1 |       32 |
SinTwoPi_4096            |    1075 |    1075 |    1077 |          5 |             4 ms |        0 |       32 |
SinTwoPow30_1            |    1828 |    1819 |    2999 |        209 |             2 us |       81 |       32 |
SinTwoPow30_128          |    1624 |    1622 |    1672 |         20 |           154 us |       10 |       32 |
SinTwoPow30_1024         |    1620 |    1620 |    1627 |          8 |             1 ms |        2 |       32 |
SinTwoPow30_4096         |    1620 |    1620 |    1622 |          5 |             4 ms |        0 |       32 |
SinVeryLarge_1           |    1685 |    1674 |    2853 |        213 |             2 us |       80 |       32 |
SinVeryLarge_128         |    1455 |    1451 |    1511 |         17 |           146 us |       13 |       32 |
SinVeryLarge_1024        |    1450 |    1449 |    1457 |          8 |             1 ms |        2 |       32 |
SinVeryLarge_4096        |    1449 |    1449 |    1451 |          5 |             4 ms |        0 |       32 |
AmdSin_1                 |    1820 |    1819 |    2197 |        205 |             2 us |       26 |       32 |
AmdSin_128               |    1803 |    1802 |    1809 |          7 |           165 us |        2 |       32 |
AmdSin_1024              |    1801 |    1801 |    1802 |          4 |             1 ms |        0 |       32 |
AmdSin_4096              |    1801 |    1801 |    1801 |          4 |             5 ms |        0 |       32 |
AmdSinTwoPi_1            |     848 |     845 |    1744 |        247 |             2 us |       57 |       32 |
AmdSinTwoPi_128          |     823 |     823 |     833 |         12 |           114 us |        2 |       32 |
AmdSinTwoPi_1024         |     821 |     821 |     822 |          5 |           904 us |        0 |       32 |
AmdSinTwoPi_4096         |     821 |     821 |     821 |          4 |             3 ms |        0 |       32 |
AmdSinTwoPow30_1         |    1492 |     845 |    2267 |       1000 |             2 us |      463 |       32 |
AmdSinTwoPow30_128       |    1456 |    1358 |    1550 |         18 |           146 us |       51 |       32 |
AmdSinTwoPow30_1024      |    1453 |    1437 |    1466 |          8 |             1 ms |        8 |       32 |
AmdSinTwoPow30_4096      |    1459 |    1446 |    1477 |         17 |             4 ms |        7 |       32 |
AmdSinVeryLarge_1        |    1534 |    1531 |    2342 |        218 |             2 us |       54 |       32 |
AmdSinVeryLarge_128      |    1510 |    1509 |    1519 |          9 |           149 us |        3 |       32 |
AmdSinVeryLarge_1024     |    1508 |    1508 |    1509 |          4 |             1 ms |        0 |       32 |
AmdSinVeryLarge_4096     |    1508 |    1508 |    1508 |          4 |             4 ms |        0 |       32 |
Sinf_1                   |    1320 |    1040 |    2794 |        345 |             2 us |      109 |       32 |
Sinf_128                 |    1271 |    1249 |    1689 |         42 |           135 us |       65 |       32 |
Sinf_1024                |    1193 |    1190 |    1199 |          9 |             1 ms |        2 |       32 |
Sinf_4096                |    1191 |    1191 |    1192 |          4 |             4 ms |        0 |       32 |
SinfTwoPi_1              |     953 |     945 |    2427 |        244 |             2 us |       94 |       32 |
SinfTwoPi_128            |    1075 |    1069 |    1531 |         67 |           123 us |       56 |       32 |
SinfTwoPi_1024           |     837 |     837 |     845 |         11 |           901 us |        2 |       32 |
SinfTwoPi_4096           |     836 |     836 |     838 |          6 |             3 ms |        0 |       32 |
SinfTwoPow30_1           |     778 |     770 |    2318 |        250 |             2 us |       97 |       32 |
SinfTwoPow30_128         |     912 |     906 |    1375 |         73 |           116 us |       54 |       32 |
SinfTwoPow30_1024        |     640 |     640 |     648 |         12 |           829 us |        2 |       32 |
SinfTwoPow30_4096        |     640 |     640 |     642 |          7 |             3 ms |        0 |       32 |
SinfVeryLarge_1          |    1187 |    1113 |    2430 |        233 |             2 us |       81 |       32 |
SinfVeryLarge_128        |    1207 |    1199 |    1747 |         67 |           130 us |       66 |       32 |
SinfVeryLarge_1024       |    1041 |    1041 |    1047 |         10 |           983 us |        1 |       32 |
SinfVeryLarge_4096       |    1041 |    1041 |    1043 |          5 |             3 ms |        0 |       32 |
AmdSinf_1                |     354 |     350 |    1599 |        275 |             1 us |       75 |       32 |
AmdSinf_128              |     839 |     832 |    1598 |         96 |           115 us |       77 |       32 |
AmdSinf_1024             |     346 |     346 |     347 |          6 |           734 us |        0 |       32 |
AmdSinf_4096             |     346 |     346 |     346 |          4 |             2 ms |        0 |       32 |
AmdSinfTwoPi_1           |     174 |     172 |     857 |        283 |             1 us |       40 |       32 |
AmdSinfTwoPi_128         |     403 |     400 |     728 |         91 |            91 us |       34 |       32 |
AmdSinfTwoPi_1024        |     160 |     160 |     161 |          9 |           641 us |        0 |       32 |
AmdSinfTwoPi_4096        |     160 |     160 |     160 |          4 |             2 ms |        0 |       32 |
AmdSinfTwoPow30_1        |     359 |     355 |    1711 |        274 |             1 us |       81 |       32 |
AmdSinfTwoPow30_128      |     839 |     832 |    1602 |         97 |           113 us |       77 |       32 |
AmdSinfTwoPow30_1024     |     346 |     346 |     347 |          6 |           716 us |        0 |       32 |
AmdSinfTwoPow30_4096     |     346 |     346 |     346 |          4 |             2 ms |        0 |       32 |
AmdSinfVeryLarge_1       |     344 |     340 |    1478 |        273 |             1 us |       68 |       32 |
AmdSinfVeryLarge_128     |     834 |     827 |    1595 |         97 |           113 us |       77 |       32 |
AmdSinfVeryLarge_1024    |     321 |     321 |     323 |          9 |           706 us |        0 |       32 |
AmdSinfVeryLarge_4096    |     321 |     321 |     321 |          4 |             2 ms |        0 |       32 |
[5/5] Running hermetic test libc.benchmarks.gpu.src.math.atan2_benchmark
Running Suite: LlvmLibcAtan2GpuBenchmark
Benchmark                |  Cycles |     Min |     Max | Iterations | Time / Iteration |   Stddev |  Threads |
--------------------------------------------------------------------------------------------------------------
Atan2_1                  |    2589 |    1513 |    4102 |        723 |             2 us |      483 |       32 |
Atan2_128                |    2537 |    2469 |    2596 |         20 |           215 us |       41 |       32 |
Atan2_1024               |    2553 |    2532 |    2582 |         11 |             1 ms |       14 |       32 |
Atan2_4096               |    2556 |    2546 |    2570 |         14 |             6 ms |        7 |       32 |
Atan2TwoPi_1             |    2313 |    2309 |    3151 |        191 |             2 us |       60 |       32 |
Atan2TwoPi_128           |    2241 |    2238 |    2290 |         16 |           192 us |       12 |       32 |
Atan2TwoPi_1024          |    2286 |    2285 |    2291 |          6 |             1 ms |        2 |       32 |
Atan2TwoPi_4096          |    2286 |    2286 |    2287 |          4 |             6 ms |        0 |       32 |
Atan2TwoPow30_1          |    2313 |    2309 |    3127 |        189 |             2 us |       59 |       32 |
Atan2TwoPow30_128        |    2241 |    2238 |    2290 |         16 |           194 us |       12 |       32 |
Atan2TwoPow30_1024       |    2286 |    2285 |    2292 |          7 |             1 ms |        2 |       32 |
Atan2TwoPow30_4096       |    2286 |    2286 |    2287 |          4 |             6 ms |        0 |       32 |
Atan2Large_1             |    2758 |    1511 |    4039 |        179 |             2 us |      211 |       32 |
Atan2Large_128           |    2676 |    2647 |    2735 |         14 |           214 us |       22 |       32 |
Atan2Large_1024          |    2724 |    2717 |    2731 |          9 |             1 ms |        3 |       32 |
Atan2Large_4096          |    2721 |    2717 |    2725 |          5 |             6 ms |        2 |       32 |
AmdAtan2_1               |     959 |     957 |    1364 |        226 |             2 us |       27 |       32 |
AmdAtan2_128             |     920 |     920 |     925 |          8 |           138 us |        1 |       32 |
AmdAtan2_1024            |     919 |     919 |     920 |          4 |             1 ms |        0 |       32 |
AmdAtan2_4096            |     919 |     919 |     919 |          4 |             4 ms |        0 |       32 |
AmdAtan2TwoPi_1          |     959 |     957 |    1337 |        229 |             2 us |       25 |       32 |
AmdAtan2TwoPi_128        |     928 |     928 |     935 |         10 |           129 us |        2 |       32 |
AmdAtan2TwoPi_1024       |     925 |     924 |     928 |          8 |             1 ms |        1 |       32 |
AmdAtan2TwoPi_4096       |     919 |     919 |     919 |          4 |             4 ms |        0 |       32 |
AmdAtan2TwoPow30_1       |     958 |     957 |    1326 |        226 |             2 us |       24 |       32 |
AmdAtan2TwoPow30_128     |     928 |     928 |     935 |         10 |           131 us |        2 |       32 |
AmdAtan2TwoPow30_1024    |     925 |     924 |     928 |         12 |             1 ms |        1 |       32 |
AmdAtan2TwoPow30_4096    |     919 |     919 |     919 |          4 |             4 ms |        0 |       32 |
AmdAtan2Large_1          |     959 |     957 |    1490 |        227 |             2 us |       35 |       32 |
AmdAtan2Large_128        |     926 |     925 |     933 |          9 |           130 us |        2 |       32 |
AmdAtan2Large_1024       |     919 |     919 |     919 |          4 |             1 ms |        0 |       32 |
AmdAtan2Large_4096       |     919 |     919 |     919 |          4 |             4 ms |        0 |       32 |

llvmbot · 2025-08-14T18:58:23Z

@llvm/pr-subscribers-libc

Author: Leandro Lacerda (leandrolcampos)

Changes

This patch improves the GPU benchmarking in this way:

Replace rand/srand with a deterministic per-thread RNG seeded by call_index: reproducible, apples-to-apples libc vs vendor comparisons.
Fix input generation: sample the unbiased exponent uniformly in [min_exp, max_exp], clamp bounds, and skip Inf, NaN, -0.0, and +0.0.
Fix standard deviation: use an explicit estimator from sums and sums-of-squares (sqrt(E[x^2] − E[x]^2)) across samples.
Fix throughput overhead: subtract a loop-only baseline inside NVPTX/AMDGPU timing backends so benchmark() gets cycles-per-call already corrected (no overhead() call).
Adapt existing math benchmarks to the new RNG/timing plumbing (plumb call_index, drop rand/srand, clean includes).

TODO (before merge)

Investigate compiler warnings and address their root causes.
Review how per-thread results are aggregated into the overall result.

Follow-ups (future PRs)

Add support to run throughput benchmarks with uniform (linear) input distributions, alongside the current log2-uniform scheme.
Review/adjust the configuration and coverage of existing math benchmarks.
Add more math benchmarks (e.g., exp/expf, others).

Patch is 34.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153512.diff

13 Files Affected:

(modified) libc/benchmarks/gpu/CMakeLists.txt (-4)
(modified) libc/benchmarks/gpu/LibcGpuBenchmark.cpp (+32-32)
(modified) libc/benchmarks/gpu/LibcGpuBenchmark.h (+177-60)
(modified) libc/benchmarks/gpu/src/ctype/CMakeLists.txt (+2)
(modified) libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp (+4-3)
(modified) libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp (+2-1)
(modified) libc/benchmarks/gpu/src/math/CMakeLists.txt (-10)
(modified) libc/benchmarks/gpu/src/math/atan2_benchmark.cpp (+9-9)
(modified) libc/benchmarks/gpu/src/math/sin_benchmark.cpp (+9-17)
(modified) libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt (+2-1)
(modified) libc/benchmarks/gpu/timing/amdgpu/timing.h (+84-14)
(modified) libc/benchmarks/gpu/timing/nvptx/CMakeLists.txt (+2-1)
(modified) libc/benchmarks/gpu/timing/nvptx/timing.h (+83-15)

diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ec64bf270b53..ce3b0228c2076 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -22,8 +22,6 @@ function(add_benchmark benchmark_name)
       ${BENCHMARK_LINK_LIBRARIES}
     DEPENDS
       libc.src.stdio.printf
-      libc.src.stdlib.srand
-      libc.src.stdlib.rand
       ${BENCHMARK_DEPENDS}
     ${BENCHMARK_UNPARSED_ARGUMENTS}
     COMPILE_OPTIONS
@@ -64,8 +62,6 @@ add_unittest_framework_library(
     libc.src.__support.FPUtil.sqrt
     libc.src.__support.fixedvector
     libc.src.time.clock
-    libc.src.stdlib.rand
-    libc.src.stdlib.srand
     libc.benchmarks.gpu.timing.timing
     libc.src.stdio.printf
 )
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
index 57ff5b9fdb846..28a4ebfc6df19 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
@@ -1,4 +1,5 @@
 #include "LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
 #include "src/__support/CPP/algorithm.h"
 #include "src/__support/CPP/array.h"
 #include "src/__support/CPP/atomic.h"
@@ -9,7 +10,6 @@
 #include "src/__support/macros/config.h"
 #include "src/__support/time/gpu/time_utils.h"
 #include "src/stdio/printf.h"
-#include "src/stdlib/srand.h"
 
 namespace LIBC_NAMESPACE_DECL {
 namespace benchmarks {
@@ -139,10 +139,8 @@ void print_header() {
 void Benchmark::run_benchmarks() {
   uint64_t id = gpu::get_thread_id();
 
-  if (id == 0) {
+  if (id == 0)
     print_header();
-    LIBC_NAMESPACE::srand(gpu::processor_clock());
-  }
 
   gpu::sync_threads();
 
@@ -163,70 +161,72 @@ void Benchmark::run_benchmarks() {
   gpu::sync_threads();
 }
 
-BenchmarkResult benchmark(const BenchmarkOptions &options,
-                          cpp::function<uint64_t(void)> wrapper_func) {
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+          const cpp::function<uint64_t(uint32_t)> &wrapper_func) {
   BenchmarkResult result;
   RuntimeEstimationProgression rep;
-  uint32_t total_iterations = 0;
   uint32_t iterations = options.initial_iterations;
+
   if (iterations < 1u)
     iterations = 1;
 
   uint32_t samples = 0;
   uint64_t total_time = 0;
-  uint64_t best_guess = 0;
-  uint64_t cycles_squared = 0;
   uint64_t min = UINT64_MAX;
   uint64_t max = 0;
 
-  uint64_t overhead = UINT64_MAX;
-  int overhead_iterations = 10;
-  for (int i = 0; i < overhead_iterations; i++)
-    overhead = cpp::min(overhead, LIBC_NAMESPACE::overhead());
+  uint32_t call_index = 0;
 
   for (int64_t time_budget = options.max_duration; time_budget >= 0;) {
-    uint64_t sample_cycles = 0;
-    const clock_t start = static_cast<double>(clock());
-    for (uint32_t i = 0; i < iterations; i++) {
-      auto wrapper_intermediate = wrapper_func();
-      uint64_t current_result = wrapper_intermediate - overhead;
+    RefinableRuntimeEstimator sample_estimator;
+
+    const clock_t start = clock();
+    while (sample_estimator.get_iterations() < iterations) {
+      auto current_result = wrapper_func(call_index++);
       max = cpp::max(max, current_result);
       min = cpp::min(min, current_result);
-      sample_cycles += current_result;
+      sample_estimator.update(current_result);
     }
     const clock_t end = clock();
+
     const clock_t duration_ns =
         ((end - start) * 1000 * 1000 * 1000) / CLOCKS_PER_SEC;
     total_time += duration_ns;
     time_budget -= duration_ns;
     samples++;
-    cycles_squared += sample_cycles * sample_cycles;
 
-    total_iterations += iterations;
-    const double change_ratio =
-        rep.compute_improvement({iterations, sample_cycles});
-    best_guess = rep.current_estimation;
+    const double change_ratio = rep.compute_improvement(sample_estimator);
 
     if (samples >= options.max_samples || iterations >= options.max_iterations)
       break;
+
+    const auto total_iterations = rep.get_estimator().get_iterations();
+
     if (total_time >= options.min_duration && samples >= options.min_samples &&
         total_iterations >= options.min_iterations &&
         change_ratio < options.epsilon)
       break;
 
-    iterations *= options.scaling_factor;
+    iterations = static_cast<uint32_t>(iterations * options.scaling_factor);
   }
-  result.cycles = best_guess;
-  result.standard_deviation = fputil::sqrt<double>(
-      static_cast<double>(cycles_squared) / total_iterations -
-      static_cast<double>(best_guess * best_guess));
+
+  const auto &estimator = rep.get_estimator();
+  result.cycles = static_cast<uint64_t>(estimator.get_mean());
+  result.standard_deviation = estimator.get_stddev();
+
   result.min = min;
   result.max = max;
   result.samples = samples;
-  result.total_iterations = total_iterations;
-  result.total_time = total_time / total_iterations;
+
+  result.total_iterations = estimator.get_iterations();
+  if (result.total_iterations > 0)
+    result.total_time = total_time / result.total_iterations;
+  else
+    result.total_time = 0;
+
   return result;
-};
+}
 
 } // namespace benchmarks
 } // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.h b/libc/benchmarks/gpu/LibcGpuBenchmark.h
index a6cf62dd30ce5..c4088d90f80fa 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.h
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.h
@@ -4,14 +4,15 @@
 #include "benchmarks/gpu/BenchmarkLogger.h"
 #include "benchmarks/gpu/timing/timing.h"
 #include "hdr/stdint_proxy.h"
+#include "src/__support/CPP/algorithm.h"
 #include "src/__support/CPP/array.h"
 #include "src/__support/CPP/functional.h"
 #include "src/__support/CPP/limits.h"
 #include "src/__support/CPP/string_view.h"
 #include "src/__support/CPP/type_traits.h"
 #include "src/__support/FPUtil/FPBits.h"
+#include "src/__support/FPUtil/sqrt.h"
 #include "src/__support/macros/config.h"
-#include "src/stdlib/rand.h"
 #include "src/time/clock.h"
 
 namespace LIBC_NAMESPACE_DECL {
@@ -30,40 +31,82 @@ struct BenchmarkOptions {
   double scaling_factor = 1.4;
 };
 
-struct Measurement {
+class RefinableRuntimeEstimator {
   uint32_t iterations = 0;
-  uint64_t elapsed_cycles = 0;
-};
-
-class RefinableRuntimeEstimation {
-  uint64_t total_cycles = 0;
-  uint32_t total_iterations = 0;
+  uint64_t sum_of_cycles = 0;
+  uint64_t sum_of_squared_cycles = 0;
 
 public:
-  uint64_t update(const Measurement &M) {
-    total_cycles += M.elapsed_cycles;
-    total_iterations += M.iterations;
-    return total_cycles / total_iterations;
+  void update(uint64_t cycles) noexcept {
+    iterations += 1;
+    sum_of_cycles += cycles;
+    sum_of_squared_cycles += cycles * cycles;
+  }
+
+  void update(const RefinableRuntimeEstimator &other) noexcept {
+    iterations += other.iterations;
+    sum_of_cycles += other.sum_of_cycles;
+    sum_of_squared_cycles += other.sum_of_squared_cycles;
   }
+
+  double get_mean() const noexcept {
+    if (iterations == 0)
+      return 0.0;
+
+    return static_cast<double>(sum_of_cycles) / iterations;
+  }
+
+  double get_variance() const noexcept {
+    if (iterations == 0)
+      return 0.0;
+
+    const double num = static_cast<double>(iterations);
+    const double sum_x = static_cast<double>(sum_of_cycles);
+    const double sum_x2 = static_cast<double>(sum_of_squared_cycles);
+
+    const double mean_of_squares = sum_x2 / num;
+    const double mean = sum_x / num;
+    const double mean_squared = mean * mean;
+    const double variance = mean_of_squares - mean_squared;
+
+    return variance < 0.0 ? 0.0 : variance;
+  }
+
+  double get_stddev() const noexcept {
+    return fputil::sqrt<double>(get_variance());
+  }
+
+  uint32_t get_iterations() const noexcept { return iterations; }
 };
 
 // Tracks the progression of the runtime estimation
 class RuntimeEstimationProgression {
-  RefinableRuntimeEstimation rre;
+  RefinableRuntimeEstimator estimator;
+  double current_mean = 0.0;
 
 public:
-  uint64_t current_estimation = 0;
+  const RefinableRuntimeEstimator &get_estimator() const noexcept {
+    return estimator;
+  }
 
-  double compute_improvement(const Measurement &M) {
-    const uint64_t new_estimation = rre.update(M);
-    double ratio =
-        (static_cast<double>(current_estimation) / new_estimation) - 1.0;
+  double
+  compute_improvement(const RefinableRuntimeEstimator &sample_estimator) {
+    if (sample_estimator.get_iterations() == 0)
+      return 1.0;
 
-    // Get absolute value
+    estimator.update(sample_estimator);
+
+    const double new_mean = estimator.get_mean();
+    if (current_mean == 0.0 || new_mean == 0.0) {
+      current_mean = new_mean;
+      return 1.0;
+    }
+
+    double ratio = (current_mean / new_mean) - 1.0;
     if (ratio < 0)
-      ratio *= -1;
+      ratio = -ratio;
 
-    current_estimation = new_estimation;
+    current_mean = new_mean;
     return ratio;
   }
 };
@@ -78,17 +121,18 @@ struct BenchmarkResult {
   clock_t total_time = 0;
 };
 
-BenchmarkResult benchmark(const BenchmarkOptions &options,
-                          cpp::function<uint64_t(void)> wrapper_func);
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+          const cpp::function<uint64_t(uint32_t)> &wrapper_func);
 
 class Benchmark {
-  const cpp::function<uint64_t(void)> func;
+  const cpp::function<uint64_t(uint32_t)> func;
   const cpp::string_view suite_name;
   const cpp::string_view test_name;
   const uint32_t num_threads;
 
 public:
-  Benchmark(cpp::function<uint64_t(void)> func, char const *suite_name,
+  Benchmark(cpp::function<uint64_t(uint32_t)> func, char const *suite_name,
             char const *test_name, uint32_t num_threads)
       : func(func), suite_name(suite_name), test_name(test_name),
         num_threads(num_threads) {
@@ -109,63 +153,135 @@ class Benchmark {
   }
 };
 
-// We want our random values to be approximately
-// Output: a random number with the exponent field between min_exp and max_exp,
-// i.e. 2^min_exp <= |real_value| < 2^(max_exp + 1),
-// Caveats:
-//   -EXP_BIAS corresponding to denormal values,
-//   EXP_BIAS + 1 corresponding to inf or nan.
+class RandomGenerator {
+  uint64_t state;
+
+  static LIBC_INLINE uint64_t splitmix64(uint64_t x) noexcept {
+    x += 0x9E3779B97F4A7C15ULL;
+    x = (x ^ (x >> 30)) * 0xBF58476D1CE4E5B9ULL;
+    x = (x ^ (x >> 27)) * 0x94D049BB133111EBULL;
+    x = (x ^ (x >> 31));
+    return x ? x : 0x9E3779B97F4A7C15ULL;
+  }
+
+public:
+  explicit LIBC_INLINE RandomGenerator(uint64_t seed) noexcept
+      : state(splitmix64(seed)) {}
+
+  LIBC_INLINE uint64_t next64() noexcept {
+    uint64_t x = state;
+    x ^= x >> 12;
+    x ^= x << 25;
+    x ^= x >> 27;
+    state = x;
+    return x * 0x2545F4914F6CDD1DULL;
+  }
+
+  LIBC_INLINE uint32_t next32() noexcept {
+    return static_cast<uint32_t>(next64() >> 32);
+  }
+};
+
+// We want random floating-point values whose *unbiased* exponent e is
+// approximately uniform in [min_exp, max_exp]. That is,
+//   2^min_exp <= |value| < 2^(max_exp + 1).
+// Caveats / boundaries:
+// - e = -EXP_BIAS  ==> subnormal range (biased exponent = 0). We ensure a
+//                      non-zero mantissa so we don't accidentally produce 0.
+// - e in [1 - EXP_BIAS, EXP_BIAS] ==> normal numbers.
+// - e = EXP_BIAS + 1 ==> Inf/NaN. We do not include it by default; max_exp
+//                        defaults to EXP_BIAS.
 template <typename T>
 static T
-get_rand_input(int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
-               int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
+get_rand_input(RandomGenerator &rng,
+               int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
+               int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
   using FPBits = LIBC_NAMESPACE::fputil::FPBits<T>;
-
-  // Required to correctly instantiate FPBits for floats and doubles.
-  using RandType = typename cpp::conditional_t<(cpp::is_same_v<T, double>),
-                                               uint64_t, uint32_t>;
-  RandType bits;
-  if constexpr (cpp::is_same_v<T, uint64_t>)
-    bits = (static_cast<uint64_t>(LIBC_NAMESPACE::rand()) << 32) |
-           static_cast<uint64_t>(LIBC_NAMESPACE::rand());
-  else
-    bits = LIBC_NAMESPACE::rand();
-  double scale =
-      static_cast<double>(max_exp - min_exp + 1) / (2 * FPBits::EXP_BIAS + 1);
-  FPBits fp(bits);
-  fp.set_biased_exponent(
-      static_cast<uint32_t>(fp.get_biased_exponent() * scale + min_exp));
-  return fp.get_val();
+  using Storage = typename FPBits::StorageType;
+
+  // Sanitize and clamp requested range to what the format supports
+  if (min_exp > max_exp) {
+    auto tmp = min_exp;
+    min_exp = max_exp;
+    max_exp = tmp;
+  };
+  min_exp = cpp::max(min_exp, -FPBits::EXP_BIAS);
+  max_exp = cpp::min(max_exp, FPBits::EXP_BIAS);
+
+  // Sample unbiased exponent e uniformly in [min_exp, max_exp] without modulo
+  // bias
+  auto sample_in_range = [&](uint64_t r) -> int32_t {
+    const uint64_t range = static_cast<uint64_t>(
+        static_cast<int64_t>(max_exp) - static_cast<int64_t>(min_exp) + 1);
+    const uint64_t threshold = (-range) % range;
+    while (r < threshold)
+      r = rng.next64();
+    return static_cast<int32_t>(min_exp + static_cast<int64_t>(r % range));
+  };
+  const int32_t e = sample_in_range(rng.next64());
+
+  // Start from random bits to get random sign and mantissa
+  FPBits xbits([&] {
+    if constexpr (cpp::is_same_v<T, double>)
+      return FPBits(rng.next64());
+    else
+      return FPBits(rng.next32());
+  }());
+
+  if (e == -FPBits::EXP_BIAS) {
+    // Subnormal: biased exponent must be 0; ensure mantissa != 0 to avoid 0
+    xbits.set_biased_exponent(Storage(0));
+    if (xbits.get_mantissa() == Storage(0))
+      xbits.set_mantissa(Storage(1));
+  } else {
+    // Normal: biased exponent in [1, 2 * FPBits::EXP_BIAS]
+    const int32_t biased = e + FPBits::EXP_BIAS;
+    xbits.set_biased_exponent(static_cast<Storage>(biased));
+  }
+  return xbits.get_val();
 }
 
 template <typename T> class MathPerf {
-  using FPBits = fputil::FPBits<T>;
-  using StorageType = typename FPBits::StorageType;
-  static constexpr StorageType UIntMax =
-      cpp::numeric_limits<StorageType>::max();
+  static LIBC_INLINE uint64_t make_seed(uint64_t base_seed, uint64_t salt) {
+    const uint64_t tid = gpu::get_thread_id();
+    return base_seed ^ (salt << 32) ^ (tid * 0x9E3779B97F4A7C15ULL);
+  }
 
 public:
+  // Returns cycles-per-call (lower is better)
   template <size_t N = 1>
-  static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp) {
+  static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp,
+                                          uint32_t call_index) {
     cpp::array<T, N> inputs;
+
+    uint64_t base_seed = static_cast<uint64_t>(call_index);
+    uint64_t salt = static_cast<uint64_t>(N);
+    RandomGenerator rng(make_seed(base_seed, salt));
+
     for (size_t i = 0; i < N; ++i)
-      inputs[i] = get_rand_input<T>(min_exp, max_exp);
+      inputs[i] = get_rand_input<T>(rng, min_exp, max_exp);
 
     uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs);
 
     return total_time / N;
   }
 
-  // Throughput benchmarking for functions that take 2 inputs.
+  // Returns cycles-per-call (lower is better)
   template <size_t N = 1>
   static uint64_t run_throughput_in_range(T f(T, T), int arg1_min_exp,
                                           int arg1_max_exp, int arg2_min_exp,
-                                          int arg2_max_exp) {
+                                          int arg2_max_exp,
+                                          uint32_t call_index) {
     cpp::array<T, N> inputs1;
     cpp::array<T, N> inputs2;
+
+    uint64_t base_seed = static_cast<uint64_t>(call_index);
+    uint64_t salt = static_cast<uint64_t>(N);
+    RandomGenerator rng(make_seed(base_seed, salt));
+
     for (size_t i = 0; i < N; ++i) {
-      inputs1[i] = get_rand_input<T>(arg1_min_exp, arg1_max_exp);
-      inputs2[i] = get_rand_input<T>(arg2_min_exp, arg2_max_exp);
+      inputs1[i] = get_rand_input<T>(rng, arg1_min_exp, arg1_max_exp);
+      inputs2[i] = get_rand_input<T>(rng, arg2_min_exp, arg2_max_exp);
     }
 
     uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs1, inputs2);
@@ -193,4 +309,5 @@ template <typename T> class MathPerf {
 #define SINGLE_WAVE_BENCHMARK(SuiteName, TestName, Func)                       \
   BENCHMARK_N_THREADS(SuiteName, TestName, Func,                               \
                       LIBC_NAMESPACE::gpu::get_lane_size())
-#endif
+
+#endif // LLVM_LIBC_BENCHMARKS_LIBC_GPU_BENCHMARK_H
diff --git a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
index f277624dbb901..77e2bbe538b1f 100644
--- a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
@@ -7,6 +7,7 @@ add_benchmark(
   SRCS
     isalnum_benchmark.cpp
   DEPENDS
+    libc.hdr.stdint_proxy
     libc.src.ctype.isalnum
   LOADER_ARGS
     --threads 64
@@ -19,5 +20,6 @@ add_benchmark(
   SRCS
     isalpha_benchmark.cpp
   DEPENDS
+    libc.hdr.stdint_proxy
     libc.src.ctype.isalpha
 )
diff --git a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
index ffa5a99860bfc..28b1ee52c8dfa 100644
--- a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
@@ -1,8 +1,9 @@
 #include "benchmarks/gpu/LibcGpuBenchmark.h"
 
+#include "hdr/stdint_proxy.h"
 #include "src/ctype/isalnum.h"
 
-uint64_t BM_IsAlnum() {
+uint64_t BM_IsAlnum(uint32_t /*call_index*/) {
   char x = 'c';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
@@ -12,13 +13,13 @@ SINGLE_THREADED_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleThread,
 SINGLE_WAVE_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleWave,
                       BM_IsAlnum);
 
-uint64_t BM_IsAlnumCapital() {
+uint64_t BM_IsAlnumCapital(uint32_t /*call_index*/) {
   char x = 'A';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
 BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumCapital, BM_IsAlnumCapital);
 
-uint64_t BM_IsAlnumNotAlnum() {
+uint64_t BM_IsAlnumNotAlnum(uint32_t /*call_index*/) {
   char x = '{';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
diff --git a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
index 2038eb89bc77b..bff4edea8b690 100644
--- a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
@@ -1,8 +1,9 @@
 #include "benchmarks/gpu/LibcGpuBenchmark.h"
 
+#include "hdr/stdint_proxy.h"
 #include "src/ctype/isalpha.h"
 
-uint64_t BM_IsAlpha() {
+uint64_t BM_IsAlpha(uint32_t /*call_index*/) {
   char x = 'c';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalpha, x);
 }
diff --git a/libc/benchmarks/gpu/src/math/CMakeLists.txt b/libc/benchmarks/gpu/src/math/CMakeLists.txt
index 7a12ce4e61c9e..8417f23c124a0 100644
--- a/libc/benchmarks/gpu/src/math/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/math/CMakeLists.txt
@@ -34,11 +34,6 @@ add_benchmark(
     libc.hdr.stdint_proxy
     libc.src.math.sin
     libc.src.math.sinf
-    libc.src.stdlib.srand
-    libc.src.stdlib.rand
-    libc.src.__support.FPUtil.fp_bits
-    libc.src.__support.CPP.bit
-    libc.src.__support.CPP.array
   COMPILE_OPTIONS
     ${math_benchmark_flags}
   LOADER_ARGS
@@ -54,11 +49,6 @@ add_benchmark(
   DEPENDS
     libc.hdr.stdint_proxy
     libc.src.math.atan2
-    libc.src.stdlib.srand
-    libc.src.stdlib.rand
-    libc.src.__support.FPUtil.fp_bits
-    libc.src.__support.CPP.bit
-    libc.src.__support.CPP.array
   COMPILE_OPTIONS
     ${math_benchmark_flags}
   LOADER_ARGS
diff --git a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
index 1f91a9a35c373..82bb0c5d7de49 100644
--- a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
@@ -1,27 +1,27 @@
 #include "benchmarks/gpu/LibcGpuBenchmark.h"
 
+#include "hdr/stdint_proxy.h"
 #include "src/math/atan2.h"
-#include "src/stdlib/rand.h"
 
 #if defined(NVPTX_MATH_FOUND) || defined(AMDGPU_MATH_FOUND)
 #include "platform.h"
 #endif
 
-#define BM_TWO_RANDOM_INPUT(T, Func, MIN_EXP, MAX_EXP, N)                      ...
[truncated]

llvmbot · 2025-08-14T18:58:24Z

@llvm/pr-subscribers-backend-amdgpu

Author: Leandro Lacerda (leandrolcampos)

Changes

This patch improves the GPU benchmarking in this way:

Replace rand/srand with a deterministic per-thread RNG seeded by call_index: reproducible, apples-to-apples libc vs vendor comparisons.
Fix input generation: sample the unbiased exponent uniformly in [min_exp, max_exp], clamp bounds, and skip Inf, NaN, -0.0, and +0.0.
Fix standard deviation: use an explicit estimator from sums and sums-of-squares (sqrt(E[x^2] − E[x]^2)) across samples.
Fix throughput overhead: subtract a loop-only baseline inside NVPTX/AMDGPU timing backends so benchmark() gets cycles-per-call already corrected (no overhead() call).
Adapt existing math benchmarks to the new RNG/timing plumbing (plumb call_index, drop rand/srand, clean includes).

TODO (before merge)

Investigate compiler warnings and address their root causes.
Review how per-thread results are aggregated into the overall result.

Follow-ups (future PRs)

Add support to run throughput benchmarks with uniform (linear) input distributions, alongside the current log2-uniform scheme.
Review/adjust the configuration and coverage of existing math benchmarks.
Add more math benchmarks (e.g., exp/expf, others).

Patch is 34.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153512.diff

13 Files Affected:

(modified) libc/benchmarks/gpu/CMakeLists.txt (-4)
(modified) libc/benchmarks/gpu/LibcGpuBenchmark.cpp (+32-32)
(modified) libc/benchmarks/gpu/LibcGpuBenchmark.h (+177-60)
(modified) libc/benchmarks/gpu/src/ctype/CMakeLists.txt (+2)
(modified) libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp (+4-3)
(modified) libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp (+2-1)
(modified) libc/benchmarks/gpu/src/math/CMakeLists.txt (-10)
(modified) libc/benchmarks/gpu/src/math/atan2_benchmark.cpp (+9-9)
(modified) libc/benchmarks/gpu/src/math/sin_benchmark.cpp (+9-17)
(modified) libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt (+2-1)
(modified) libc/benchmarks/gpu/timing/amdgpu/timing.h (+84-14)
(modified) libc/benchmarks/gpu/timing/nvptx/CMakeLists.txt (+2-1)
(modified) libc/benchmarks/gpu/timing/nvptx/timing.h (+83-15)

diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ec64bf270b53..ce3b0228c2076 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -22,8 +22,6 @@ function(add_benchmark benchmark_name)
       ${BENCHMARK_LINK_LIBRARIES}
     DEPENDS
       libc.src.stdio.printf
-      libc.src.stdlib.srand
-      libc.src.stdlib.rand
       ${BENCHMARK_DEPENDS}
     ${BENCHMARK_UNPARSED_ARGUMENTS}
     COMPILE_OPTIONS
@@ -64,8 +62,6 @@ add_unittest_framework_library(
     libc.src.__support.FPUtil.sqrt
     libc.src.__support.fixedvector
     libc.src.time.clock
-    libc.src.stdlib.rand
-    libc.src.stdlib.srand
     libc.benchmarks.gpu.timing.timing
     libc.src.stdio.printf
 )
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
index 57ff5b9fdb846..28a4ebfc6df19 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
@@ -1,4 +1,5 @@
 #include "LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
 #include "src/__support/CPP/algorithm.h"
 #include "src/__support/CPP/array.h"
 #include "src/__support/CPP/atomic.h"
@@ -9,7 +10,6 @@
 #include "src/__support/macros/config.h"
 #include "src/__support/time/gpu/time_utils.h"
 #include "src/stdio/printf.h"
-#include "src/stdlib/srand.h"
 
 namespace LIBC_NAMESPACE_DECL {
 namespace benchmarks {
@@ -139,10 +139,8 @@ void print_header() {
 void Benchmark::run_benchmarks() {
   uint64_t id = gpu::get_thread_id();
 
-  if (id == 0) {
+  if (id == 0)
     print_header();
-    LIBC_NAMESPACE::srand(gpu::processor_clock());
-  }
 
   gpu::sync_threads();
 
@@ -163,70 +161,72 @@ void Benchmark::run_benchmarks() {
   gpu::sync_threads();
 }
 
-BenchmarkResult benchmark(const BenchmarkOptions &options,
-                          cpp::function<uint64_t(void)> wrapper_func) {
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+          const cpp::function<uint64_t(uint32_t)> &wrapper_func) {
   BenchmarkResult result;
   RuntimeEstimationProgression rep;
-  uint32_t total_iterations = 0;
   uint32_t iterations = options.initial_iterations;
+
   if (iterations < 1u)
     iterations = 1;
 
   uint32_t samples = 0;
   uint64_t total_time = 0;
-  uint64_t best_guess = 0;
-  uint64_t cycles_squared = 0;
   uint64_t min = UINT64_MAX;
   uint64_t max = 0;
 
-  uint64_t overhead = UINT64_MAX;
-  int overhead_iterations = 10;
-  for (int i = 0; i < overhead_iterations; i++)
-    overhead = cpp::min(overhead, LIBC_NAMESPACE::overhead());
+  uint32_t call_index = 0;
 
   for (int64_t time_budget = options.max_duration; time_budget >= 0;) {
-    uint64_t sample_cycles = 0;
-    const clock_t start = static_cast<double>(clock());
-    for (uint32_t i = 0; i < iterations; i++) {
-      auto wrapper_intermediate = wrapper_func();
-      uint64_t current_result = wrapper_intermediate - overhead;
+    RefinableRuntimeEstimator sample_estimator;
+
+    const clock_t start = clock();
+    while (sample_estimator.get_iterations() < iterations) {
+      auto current_result = wrapper_func(call_index++);
       max = cpp::max(max, current_result);
       min = cpp::min(min, current_result);
-      sample_cycles += current_result;
+      sample_estimator.update(current_result);
     }
     const clock_t end = clock();
+
     const clock_t duration_ns =
         ((end - start) * 1000 * 1000 * 1000) / CLOCKS_PER_SEC;
     total_time += duration_ns;
     time_budget -= duration_ns;
     samples++;
-    cycles_squared += sample_cycles * sample_cycles;
 
-    total_iterations += iterations;
-    const double change_ratio =
-        rep.compute_improvement({iterations, sample_cycles});
-    best_guess = rep.current_estimation;
+    const double change_ratio = rep.compute_improvement(sample_estimator);
 
     if (samples >= options.max_samples || iterations >= options.max_iterations)
       break;
+
+    const auto total_iterations = rep.get_estimator().get_iterations();
+
     if (total_time >= options.min_duration && samples >= options.min_samples &&
         total_iterations >= options.min_iterations &&
         change_ratio < options.epsilon)
       break;
 
-    iterations *= options.scaling_factor;
+    iterations = static_cast<uint32_t>(iterations * options.scaling_factor);
   }
-  result.cycles = best_guess;
-  result.standard_deviation = fputil::sqrt<double>(
-      static_cast<double>(cycles_squared) / total_iterations -
-      static_cast<double>(best_guess * best_guess));
+
+  const auto &estimator = rep.get_estimator();
+  result.cycles = static_cast<uint64_t>(estimator.get_mean());
+  result.standard_deviation = estimator.get_stddev();
+
   result.min = min;
   result.max = max;
   result.samples = samples;
-  result.total_iterations = total_iterations;
-  result.total_time = total_time / total_iterations;
+
+  result.total_iterations = estimator.get_iterations();
+  if (result.total_iterations > 0)
+    result.total_time = total_time / result.total_iterations;
+  else
+    result.total_time = 0;
+
   return result;
-};
+}
 
 } // namespace benchmarks
 } // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.h b/libc/benchmarks/gpu/LibcGpuBenchmark.h
index a6cf62dd30ce5..c4088d90f80fa 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.h
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.h
@@ -4,14 +4,15 @@
 #include "benchmarks/gpu/BenchmarkLogger.h"
 #include "benchmarks/gpu/timing/timing.h"
 #include "hdr/stdint_proxy.h"
+#include "src/__support/CPP/algorithm.h"
 #include "src/__support/CPP/array.h"
 #include "src/__support/CPP/functional.h"
 #include "src/__support/CPP/limits.h"
 #include "src/__support/CPP/string_view.h"
 #include "src/__support/CPP/type_traits.h"
 #include "src/__support/FPUtil/FPBits.h"
+#include "src/__support/FPUtil/sqrt.h"
 #include "src/__support/macros/config.h"
-#include "src/stdlib/rand.h"
 #include "src/time/clock.h"
 
 namespace LIBC_NAMESPACE_DECL {
@@ -30,40 +31,82 @@ struct BenchmarkOptions {
   double scaling_factor = 1.4;
 };
 
-struct Measurement {
+class RefinableRuntimeEstimator {
   uint32_t iterations = 0;
-  uint64_t elapsed_cycles = 0;
-};
-
-class RefinableRuntimeEstimation {
-  uint64_t total_cycles = 0;
-  uint32_t total_iterations = 0;
+  uint64_t sum_of_cycles = 0;
+  uint64_t sum_of_squared_cycles = 0;
 
 public:
-  uint64_t update(const Measurement &M) {
-    total_cycles += M.elapsed_cycles;
-    total_iterations += M.iterations;
-    return total_cycles / total_iterations;
+  void update(uint64_t cycles) noexcept {
+    iterations += 1;
+    sum_of_cycles += cycles;
+    sum_of_squared_cycles += cycles * cycles;
+  }
+
+  void update(const RefinableRuntimeEstimator &other) noexcept {
+    iterations += other.iterations;
+    sum_of_cycles += other.sum_of_cycles;
+    sum_of_squared_cycles += other.sum_of_squared_cycles;
   }
+
+  double get_mean() const noexcept {
+    if (iterations == 0)
+      return 0.0;
+
+    return static_cast<double>(sum_of_cycles) / iterations;
+  }
+
+  double get_variance() const noexcept {
+    if (iterations == 0)
+      return 0.0;
+
+    const double num = static_cast<double>(iterations);
+    const double sum_x = static_cast<double>(sum_of_cycles);
+    const double sum_x2 = static_cast<double>(sum_of_squared_cycles);
+
+    const double mean_of_squares = sum_x2 / num;
+    const double mean = sum_x / num;
+    const double mean_squared = mean * mean;
+    const double variance = mean_of_squares - mean_squared;
+
+    return variance < 0.0 ? 0.0 : variance;
+  }
+
+  double get_stddev() const noexcept {
+    return fputil::sqrt<double>(get_variance());
+  }
+
+  uint32_t get_iterations() const noexcept { return iterations; }
 };
 
 // Tracks the progression of the runtime estimation
 class RuntimeEstimationProgression {
-  RefinableRuntimeEstimation rre;
+  RefinableRuntimeEstimator estimator;
+  double current_mean = 0.0;
 
 public:
-  uint64_t current_estimation = 0;
+  const RefinableRuntimeEstimator &get_estimator() const noexcept {
+    return estimator;
+  }
 
-  double compute_improvement(const Measurement &M) {
-    const uint64_t new_estimation = rre.update(M);
-    double ratio =
-        (static_cast<double>(current_estimation) / new_estimation) - 1.0;
+  double
+  compute_improvement(const RefinableRuntimeEstimator &sample_estimator) {
+    if (sample_estimator.get_iterations() == 0)
+      return 1.0;
 
-    // Get absolute value
+    estimator.update(sample_estimator);
+
+    const double new_mean = estimator.get_mean();
+    if (current_mean == 0.0 || new_mean == 0.0) {
+      current_mean = new_mean;
+      return 1.0;
+    }
+
+    double ratio = (current_mean / new_mean) - 1.0;
     if (ratio < 0)
-      ratio *= -1;
+      ratio = -ratio;
 
-    current_estimation = new_estimation;
+    current_mean = new_mean;
     return ratio;
   }
 };
@@ -78,17 +121,18 @@ struct BenchmarkResult {
   clock_t total_time = 0;
 };
 
-BenchmarkResult benchmark(const BenchmarkOptions &options,
-                          cpp::function<uint64_t(void)> wrapper_func);
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+          const cpp::function<uint64_t(uint32_t)> &wrapper_func);
 
 class Benchmark {
-  const cpp::function<uint64_t(void)> func;
+  const cpp::function<uint64_t(uint32_t)> func;
   const cpp::string_view suite_name;
   const cpp::string_view test_name;
   const uint32_t num_threads;
 
 public:
-  Benchmark(cpp::function<uint64_t(void)> func, char const *suite_name,
+  Benchmark(cpp::function<uint64_t(uint32_t)> func, char const *suite_name,
             char const *test_name, uint32_t num_threads)
       : func(func), suite_name(suite_name), test_name(test_name),
         num_threads(num_threads) {
@@ -109,63 +153,135 @@ class Benchmark {
   }
 };
 
-// We want our random values to be approximately
-// Output: a random number with the exponent field between min_exp and max_exp,
-// i.e. 2^min_exp <= |real_value| < 2^(max_exp + 1),
-// Caveats:
-//   -EXP_BIAS corresponding to denormal values,
-//   EXP_BIAS + 1 corresponding to inf or nan.
+class RandomGenerator {
+  uint64_t state;
+
+  static LIBC_INLINE uint64_t splitmix64(uint64_t x) noexcept {
+    x += 0x9E3779B97F4A7C15ULL;
+    x = (x ^ (x >> 30)) * 0xBF58476D1CE4E5B9ULL;
+    x = (x ^ (x >> 27)) * 0x94D049BB133111EBULL;
+    x = (x ^ (x >> 31));
+    return x ? x : 0x9E3779B97F4A7C15ULL;
+  }
+
+public:
+  explicit LIBC_INLINE RandomGenerator(uint64_t seed) noexcept
+      : state(splitmix64(seed)) {}
+
+  LIBC_INLINE uint64_t next64() noexcept {
+    uint64_t x = state;
+    x ^= x >> 12;
+    x ^= x << 25;
+    x ^= x >> 27;
+    state = x;
+    return x * 0x2545F4914F6CDD1DULL;
+  }
+
+  LIBC_INLINE uint32_t next32() noexcept {
+    return static_cast<uint32_t>(next64() >> 32);
+  }
+};
+
+// We want random floating-point values whose *unbiased* exponent e is
+// approximately uniform in [min_exp, max_exp]. That is,
+//   2^min_exp <= |value| < 2^(max_exp + 1).
+// Caveats / boundaries:
+// - e = -EXP_BIAS  ==> subnormal range (biased exponent = 0). We ensure a
+//                      non-zero mantissa so we don't accidentally produce 0.
+// - e in [1 - EXP_BIAS, EXP_BIAS] ==> normal numbers.
+// - e = EXP_BIAS + 1 ==> Inf/NaN. We do not include it by default; max_exp
+//                        defaults to EXP_BIAS.
 template <typename T>
 static T
-get_rand_input(int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
-               int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
+get_rand_input(RandomGenerator &rng,
+               int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
+               int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
   using FPBits = LIBC_NAMESPACE::fputil::FPBits<T>;
-
-  // Required to correctly instantiate FPBits for floats and doubles.
-  using RandType = typename cpp::conditional_t<(cpp::is_same_v<T, double>),
-                                               uint64_t, uint32_t>;
-  RandType bits;
-  if constexpr (cpp::is_same_v<T, uint64_t>)
-    bits = (static_cast<uint64_t>(LIBC_NAMESPACE::rand()) << 32) |
-           static_cast<uint64_t>(LIBC_NAMESPACE::rand());
-  else
-    bits = LIBC_NAMESPACE::rand();
-  double scale =
-      static_cast<double>(max_exp - min_exp + 1) / (2 * FPBits::EXP_BIAS + 1);
-  FPBits fp(bits);
-  fp.set_biased_exponent(
-      static_cast<uint32_t>(fp.get_biased_exponent() * scale + min_exp));
-  return fp.get_val();
+  using Storage = typename FPBits::StorageType;
+
+  // Sanitize and clamp requested range to what the format supports
+  if (min_exp > max_exp) {
+    auto tmp = min_exp;
+    min_exp = max_exp;
+    max_exp = tmp;
+  };
+  min_exp = cpp::max(min_exp, -FPBits::EXP_BIAS);
+  max_exp = cpp::min(max_exp, FPBits::EXP_BIAS);
+
+  // Sample unbiased exponent e uniformly in [min_exp, max_exp] without modulo
+  // bias
+  auto sample_in_range = [&](uint64_t r) -> int32_t {
+    const uint64_t range = static_cast<uint64_t>(
+        static_cast<int64_t>(max_exp) - static_cast<int64_t>(min_exp) + 1);
+    const uint64_t threshold = (-range) % range;
+    while (r < threshold)
+      r = rng.next64();
+    return static_cast<int32_t>(min_exp + static_cast<int64_t>(r % range));
+  };
+  const int32_t e = sample_in_range(rng.next64());
+
+  // Start from random bits to get random sign and mantissa
+  FPBits xbits([&] {
+    if constexpr (cpp::is_same_v<T, double>)
+      return FPBits(rng.next64());
+    else
+      return FPBits(rng.next32());
+  }());
+
+  if (e == -FPBits::EXP_BIAS) {
+    // Subnormal: biased exponent must be 0; ensure mantissa != 0 to avoid 0
+    xbits.set_biased_exponent(Storage(0));
+    if (xbits.get_mantissa() == Storage(0))
+      xbits.set_mantissa(Storage(1));
+  } else {
+    // Normal: biased exponent in [1, 2 * FPBits::EXP_BIAS]
+    const int32_t biased = e + FPBits::EXP_BIAS;
+    xbits.set_biased_exponent(static_cast<Storage>(biased));
+  }
+  return xbits.get_val();
 }
 
 template <typename T> class MathPerf {
-  using FPBits = fputil::FPBits<T>;
-  using StorageType = typename FPBits::StorageType;
-  static constexpr StorageType UIntMax =
-      cpp::numeric_limits<StorageType>::max();
+  static LIBC_INLINE uint64_t make_seed(uint64_t base_seed, uint64_t salt) {
+    const uint64_t tid = gpu::get_thread_id();
+    return base_seed ^ (salt << 32) ^ (tid * 0x9E3779B97F4A7C15ULL);
+  }
 
 public:
+  // Returns cycles-per-call (lower is better)
   template <size_t N = 1>
-  static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp) {
+  static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp,
+                                          uint32_t call_index) {
     cpp::array<T, N> inputs;
+
+    uint64_t base_seed = static_cast<uint64_t>(call_index);
+    uint64_t salt = static_cast<uint64_t>(N);
+    RandomGenerator rng(make_seed(base_seed, salt));
+
     for (size_t i = 0; i < N; ++i)
-      inputs[i] = get_rand_input<T>(min_exp, max_exp);
+      inputs[i] = get_rand_input<T>(rng, min_exp, max_exp);
 
     uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs);
 
     return total_time / N;
   }
 
-  // Throughput benchmarking for functions that take 2 inputs.
+  // Returns cycles-per-call (lower is better)
   template <size_t N = 1>
   static uint64_t run_throughput_in_range(T f(T, T), int arg1_min_exp,
                                           int arg1_max_exp, int arg2_min_exp,
-                                          int arg2_max_exp) {
+                                          int arg2_max_exp,
+                                          uint32_t call_index) {
     cpp::array<T, N> inputs1;
     cpp::array<T, N> inputs2;
+
+    uint64_t base_seed = static_cast<uint64_t>(call_index);
+    uint64_t salt = static_cast<uint64_t>(N);
+    RandomGenerator rng(make_seed(base_seed, salt));
+
     for (size_t i = 0; i < N; ++i) {
-      inputs1[i] = get_rand_input<T>(arg1_min_exp, arg1_max_exp);
-      inputs2[i] = get_rand_input<T>(arg2_min_exp, arg2_max_exp);
+      inputs1[i] = get_rand_input<T>(rng, arg1_min_exp, arg1_max_exp);
+      inputs2[i] = get_rand_input<T>(rng, arg2_min_exp, arg2_max_exp);
     }
 
     uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs1, inputs2);
@@ -193,4 +309,5 @@ template <typename T> class MathPerf {
 #define SINGLE_WAVE_BENCHMARK(SuiteName, TestName, Func)                       \
   BENCHMARK_N_THREADS(SuiteName, TestName, Func,                               \
                       LIBC_NAMESPACE::gpu::get_lane_size())
-#endif
+
+#endif // LLVM_LIBC_BENCHMARKS_LIBC_GPU_BENCHMARK_H
diff --git a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
index f277624dbb901..77e2bbe538b1f 100644
--- a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
@@ -7,6 +7,7 @@ add_benchmark(
   SRCS
     isalnum_benchmark.cpp
   DEPENDS
+    libc.hdr.stdint_proxy
     libc.src.ctype.isalnum
   LOADER_ARGS
     --threads 64
@@ -19,5 +20,6 @@ add_benchmark(
   SRCS
     isalpha_benchmark.cpp
   DEPENDS
+    libc.hdr.stdint_proxy
     libc.src.ctype.isalpha
 )
diff --git a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
index ffa5a99860bfc..28b1ee52c8dfa 100644
--- a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
@@ -1,8 +1,9 @@
 #include "benchmarks/gpu/LibcGpuBenchmark.h"
 
+#include "hdr/stdint_proxy.h"
 #include "src/ctype/isalnum.h"
 
-uint64_t BM_IsAlnum() {
+uint64_t BM_IsAlnum(uint32_t /*call_index*/) {
   char x = 'c';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
@@ -12,13 +13,13 @@ SINGLE_THREADED_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleThread,
 SINGLE_WAVE_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleWave,
                       BM_IsAlnum);
 
-uint64_t BM_IsAlnumCapital() {
+uint64_t BM_IsAlnumCapital(uint32_t /*call_index*/) {
   char x = 'A';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
 BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumCapital, BM_IsAlnumCapital);
 
-uint64_t BM_IsAlnumNotAlnum() {
+uint64_t BM_IsAlnumNotAlnum(uint32_t /*call_index*/) {
   char x = '{';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
diff --git a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
index 2038eb89bc77b..bff4edea8b690 100644
--- a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
@@ -1,8 +1,9 @@
 #include "benchmarks/gpu/LibcGpuBenchmark.h"
 
+#include "hdr/stdint_proxy.h"
 #include "src/ctype/isalpha.h"
 
-uint64_t BM_IsAlpha() {
+uint64_t BM_IsAlpha(uint32_t /*call_index*/) {
   char x = 'c';
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalpha, x);
 }
diff --git a/libc/benchmarks/gpu/src/math/CMakeLists.txt b/libc/benchmarks/gpu/src/math/CMakeLists.txt
index 7a12ce4e61c9e..8417f23c124a0 100644
--- a/libc/benchmarks/gpu/src/math/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/math/CMakeLists.txt
@@ -34,11 +34,6 @@ add_benchmark(
     libc.hdr.stdint_proxy
     libc.src.math.sin
     libc.src.math.sinf
-    libc.src.stdlib.srand
-    libc.src.stdlib.rand
-    libc.src.__support.FPUtil.fp_bits
-    libc.src.__support.CPP.bit
-    libc.src.__support.CPP.array
   COMPILE_OPTIONS
     ${math_benchmark_flags}
   LOADER_ARGS
@@ -54,11 +49,6 @@ add_benchmark(
   DEPENDS
     libc.hdr.stdint_proxy
     libc.src.math.atan2
-    libc.src.stdlib.srand
-    libc.src.stdlib.rand
-    libc.src.__support.FPUtil.fp_bits
-    libc.src.__support.CPP.bit
-    libc.src.__support.CPP.array
   COMPILE_OPTIONS
     ${math_benchmark_flags}
   LOADER_ARGS
diff --git a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
index 1f91a9a35c373..82bb0c5d7de49 100644
--- a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
@@ -1,27 +1,27 @@
 #include "benchmarks/gpu/LibcGpuBenchmark.h"
 
+#include "hdr/stdint_proxy.h"
 #include "src/math/atan2.h"
-#include "src/stdlib/rand.h"
 
 #if defined(NVPTX_MATH_FOUND) || defined(AMDGPU_MATH_FOUND)
 #include "platform.h"
 #endif
 
-#define BM_TWO_RANDOM_INPUT(T, Func, MIN_EXP, MAX_EXP, N)                      ...
[truncated]

jhuber6

LG, thanks for fixing this!

leandrolcampos added 7 commits August 10, 2025 17:54

Replace rand and srand with per-thread RNG for reproducibility an…

6ccc76e

…d fairness

Fix random input generation

f36f86f

Fix standard deviation

aa6a03d

Fix throughput overhead

6e60a3d

Adapt math benchmarks

dc7436f

Conform to LLVM style

fc49b8e

Reorder methods in RefinableRuntimeEstimator

cfa9838

Remove redundant (void)output;

96e0bae

jhuber6 marked this pull request as ready for review August 14, 2025 18:57

llvmbot added backend:AMDGPU libc labels Aug 14, 2025

leandrolcampos added 2 commits August 14, 2025 20:08

Allow index-less benchmarks via BenchmarkTarget wrapper

a11e775

Correct statistics aggregation and reporting

b686082

jhuber6 approved these changes Aug 15, 2025

View reviewed changes

jhuber6 merged commit 08ff017 into llvm:main Aug 15, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Improve GPU benchmarking #153512

[libc] Improve GPU benchmarking #153512

Uh oh!

leandrolcampos commented Aug 14, 2025 •

edited

Loading

Uh oh!

leandrolcampos commented Aug 14, 2025

Uh oh!

jhuber6 commented Aug 14, 2025

Uh oh!

llvmbot commented Aug 14, 2025

TODO (before merge)

Follow-ups (future PRs)

Uh oh!

llvmbot commented Aug 14, 2025

TODO (before merge)

Follow-ups (future PRs)

Uh oh!

jhuber6 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[libc] Improve GPU benchmarking #153512

[libc] Improve GPU benchmarking #153512

Uh oh!

Conversation

leandrolcampos commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leandrolcampos commented Aug 14, 2025

Uh oh!

jhuber6 commented Aug 14, 2025

Uh oh!

llvmbot commented Aug 14, 2025

TODO (before merge)

Follow-ups (future PRs)

Uh oh!

llvmbot commented Aug 14, 2025

TODO (before merge)

Follow-ups (future PRs)

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leandrolcampos commented Aug 14, 2025 •

edited

Loading