-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[libc] Improve GPU benchmarking #153512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libc] Improve GPU benchmarking #153512
Conversation
Preliminary Results (NVIDIA GeForce RTX 4070 Laptop GPU) [1/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalnum_benchmark
Running Suite: LlvmLibcIsAlNumGpuBenchmark
Benchmark | Cycles | Min | Max | Iterations | Time / Iteration | Stddev | Threads |
--------------------------------------------------------------------------------------------------------------
IsAlnum | 53 | 53 | 53 | 156 | 3 us | 0 | 64 |
IsAlnumSingleThread | 53 | 53 | 53 | 157 | 3 us | 0 | 1 |
IsAlnumSingleWave | 53 | 53 | 53 | 155 | 3 us | 0 | 32 |
IsAlnumCapital | 53 | 53 | 53 | 157 | 3 us | 0 | 64 |
IsAlnumNotAlnum | 43 | 43 | 43 | 163 | 3 us | 0 | 64 |
[2/4] Running hermetic test libc.benchmarks.gpu.src.ctype.isalpha_benchmark
Running Suite: LlvmLibcIsAlphaGpuBenchmark
Benchmark | Cycles | Min | Max | Iterations | Time / Iteration | Stddev | Threads |
--------------------------------------------------------------------------------------------------------------
IsAlpha | 53 | 53 | 53 | 156 | 3 us | 0 | 1 |
[3/4] Running hermetic test libc.benchmarks.gpu.src.math.sin_benchmark
Running Suite: LlvmLibcSinGpuBenchmark
Benchmark | Cycles | Min | Max | Iterations | Time / Iteration | Stddev | Threads |
--------------------------------------------------------------------------------------------------------------
Sin_1 | 3087 | 2946 | 3637 | 202 | 17 us | 159 | 32 |
Sin_128 | 362 | 354 | 372 | 26 | 64 us | 5 | 32 |
Sin_1024 | 352 | 348 | 358 | 23 | 405 us | 2 | 32 |
Sin_4096 | 359 | 358 | 361 | 7 | 1 ms | 1 | 32 |
SinTwoPi_1 | 2205 | 2186 | 2506 | 29 | 17 us | 56 | 32 |
SinTwoPi_128 | 262 | 259 | 267 | 10 | 52 us | 2 | 32 |
SinTwoPi_1024 | 271 | 271 | 275 | 16 | 319 us | 0 | 32 |
SinTwoPi_4096 | 280 | 280 | 281 | 9 | 1 ms | 0 | 32 |
SinTwoPow30_1 | 3104 | 3086 | 3174 | 28 | 18 us | 16 | 32 |
SinTwoPow30_128 | 348 | 345 | 352 | 9 | 60 us | 1 | 32 |
SinTwoPow30_1024 | 358 | 357 | 359 | 7 | 380 us | 0 | 32 |
SinTwoPow30_4096 | 366 | 366 | 367 | 6 | 1 ms | 0 | 32 |
SinVeryLarge_1 | 2827 | 2788 | 3069 | 29 | 17 us | 46 | 32 |
SinVeryLarge_128 | 316 | 313 | 318 | 14 | 57 us | 1 | 32 |
SinVeryLarge_1024 | 316 | 315 | 320 | 16 | 348 us | 1 | 32 |
SinVeryLarge_4096 | 324 | 323 | 325 | 15 | 1 ms | 0 | 32 |
NvSin_1 | 2507 | 2262 | 2890 | 39 | 15 us | 95 | 32 |
NvSin_128 | 1862 | 1858 | 1870 | 5 | 145 us | 4 | 32 |
NvSin_1024 | 2066 | 2066 | 2068 | 5 | 1 ms | 0 | 32 |
NvSin_4096 | 2085 | 2085 | 2085 | 4 | 4 ms | 0 | 32 |
NvSinTwoPi_1 | 1103 | 1102 | 1105 | 35 | 14 us | 0 | 32 |
NvSinTwoPi_128 | 925 | 925 | 927 | 7 | 82 us | 0 | 32 |
NvSinTwoPi_1024 | 1134 | 1134 | 1134 | 4 | 665 us | 0 | 32 |
NvSinTwoPi_4096 | 1153 | 1153 | 1153 | 4 | 2 ms | 0 | 32 |
NvSinTwoPow30_1 | 1103 | 1102 | 1104 | 35 | 14 us | 0 | 32 |
NvSinTwoPow30_128 | 925 | 925 | 925 | 7 | 82 us | 0 | 32 |
NvSinTwoPow30_1024 | 1134 | 1134 | 1134 | 4 | 668 us | 0 | 32 |
NvSinTwoPow30_4096 | 1153 | 1153 | 1153 | 4 | 2 ms | 0 | 32 |
NvSinVeryLarge_1 | 2493 | 2470 | 2795 | 38 | 15 us | 50 | 32 |
NvSinVeryLarge_128 | 1827 | 1827 | 1829 | 5 | 141 us | 0 | 32 |
NvSinVeryLarge_1024 | 2033 | 2033 | 2034 | 5 | 1 ms | 0 | 32 |
NvSinVeryLarge_4096 | 2050 | 2050 | 2050 | 4 | 4 ms | 0 | 32 |
Sinf_1 | 2190 | 1524 | 2396 | 527 | 14 us | 174 | 32 |
Sinf_128 | 239 | 229 | 247 | 26 | 40 us | 4 | 32 |
Sinf_1024 | 241 | 236 | 249 | 8 | 233 us | 3 | 32 |
Sinf_4096 | 259 | 258 | 261 | 8 | 905 us | 1 | 32 |
SinfTwoPi_1 | 1447 | 1430 | 1753 | 39 | 14 us | 49 | 32 |
SinfTwoPi_128 | 147 | 146 | 149 | 19 | 34 us | 0 | 32 |
SinfTwoPi_1024 | 146 | 145 | 148 | 13 | 183 us | 0 | 32 |
SinfTwoPi_4096 | 165 | 165 | 167 | 23 | 704 us | 0 | 32 |
SinfTwoPow30_1 | 1084 | 1078 | 1163 | 35 | 14 us | 13 | 32 |
SinfTwoPow30_128 | 102 | 101 | 104 | 32 | 32 us | 0 | 32 |
SinfTwoPow30_1024 | 102 | 102 | 103 | 25 | 164 us | 0 | 32 |
SinfTwoPow30_4096 | 121 | 121 | 123 | 17 | 645 us | 0 | 32 |
SinfVeryLarge_1 | 1930 | 1870 | 2268 | 34 | 15 us | 59 | 32 |
SinfVeryLarge_128 | 205 | 205 | 207 | 18 | 38 us | 0 | 32 |
SinfVeryLarge_1024 | 205 | 205 | 207 | 10 | 218 us | 0 | 32 |
SinfVeryLarge_4096 | 224 | 224 | 226 | 14 | 845 us | 0 | 32 |
NvSinf_1 | 1020 | 1016 | 1032 | 37 | 13 us | 5 | 32 |
NvSinf_128 | 786 | 786 | 788 | 7 | 76 us | 0 | 32 |
NvSinf_1024 | 974 | 969 | 976 | 17 | 588 us | 2 | 32 |
NvSinf_4096 | 1008 | 1008 | 1009 | 4 | 2 ms | 0 | 32 |
NvSinfTwoPi_1 | 164 | 162 | 505 | 145 | 13 us | 28 | 32 |
NvSinfTwoPi_128 | 141 | 141 | 143 | 15 | 33 us | 0 | 32 |
NvSinfTwoPi_1024 | 330 | 330 | 331 | 7 | 272 us | 0 | 32 |
NvSinfTwoPi_4096 | 364 | 364 | 365 | 6 | 1 ms | 0 | 32 |
NvSinfTwoPow30_1 | 1024 | 1016 | 1272 | 64 | 14 us | 31 | 32 |
NvSinfTwoPow30_128 | 776 | 776 | 776 | 7 | 73 us | 0 | 32 |
NvSinfTwoPow30_1024 | 968 | 966 | 969 | 7 | 504 us | 1 | 32 |
NvSinfTwoPow30_4096 | 1002 | 1002 | 1002 | 4 | 1 ms | 0 | 32 |
NvSinfVeryLarge_1 | 1003 | 1001 | 1026 | 39 | 13 us | 3 | 32 |
NvSinfVeryLarge_128 | 758 | 758 | 758 | 9 | 60 us | 0 | 32 |
NvSinfVeryLarge_1024 | 950 | 950 | 951 | 4 | 478 us | 0 | 32 |
NvSinfVeryLarge_4096 | 983 | 983 | 984 | 4 | 1 ms | 0 | 32 |
[4/4] Running hermetic test libc.benchmarks.gpu.src.math.atan2_benchmark
Running Suite: LlvmLibcAtan2GpuBenchmark
Benchmark | Cycles | Min | Max | Iterations | Time / Iteration | Stddev | Threads |
--------------------------------------------------------------------------------------------------------------
Atan2_1 | 4082 | 1894 | 5241 | 723 | 14 us | 953 | 32 |
Atan2_128 | 2520 | 2454 | 2580 | 21 | 165 us | 33 | 32 |
Atan2_1024 | 2745 | 2723 | 2768 | 11 | 1 ms | 13 | 32 |
Atan2_4096 | 2750 | 2739 | 2761 | 11 | 5 ms | 6 | 32 |
Atan2TwoPi_1 | 2749 | 2731 | 3160 | 36 | 14 us | 69 | 32 |
Atan2TwoPi_128 | 1072 | 1065 | 1097 | 10 | 82 us | 8 | 32 |
Atan2TwoPi_1024 | 1302 | 1301 | 1304 | 4 | 668 us | 1 | 32 |
Atan2TwoPi_4096 | 1303 | 1303 | 1303 | 4 | 2 ms | 0 | 32 |
Atan2TwoPow30_1 | 2744 | 2729 | 3177 | 39 | 13 us | 70 | 32 |
Atan2TwoPow30_128 | 1075 | 1069 | 1101 | 10 | 84 us | 8 | 32 |
Atan2TwoPow30_1024 | 1302 | 1302 | 1304 | 4 | 677 us | 0 | 32 |
Atan2TwoPow30_4096 | 1303 | 1303 | 1304 | 4 | 2 ms | 0 | 32 |
Atan2Large_1 | 3577 | 1125 | 3888 | 142 | 14 us | 361 | 32 |
Atan2Large_128 | 1810 | 1770 | 1841 | 12 | 124 us | 17 | 32 |
Atan2Large_1024 | 2053 | 2050 | 2057 | 5 | 973 us | 2 | 32 |
Atan2Large_4096 | 2051 | 2047 | 2054 | 8 | 3 ms | 2 | 32 |
NvAtan2_1 | 2911 | 2866 | 3324 | 56 | 14 us | 64 | 32 |
NvAtan2_128 | 2838 | 2834 | 2849 | 6 | 180 us | 5 | 32 |
NvAtan2_1024 | 3075 | 3075 | 3077 | 4 | 1 ms | 0 | 32 |
NvAtan2_4096 | 3076 | 3076 | 3076 | 4 | 5 ms | 0 | 32 |
NvAtan2TwoPi_1 | 2040 | 2032 | 2382 | 42 | 13 us | 53 | 32 |
NvAtan2TwoPi_128 | 1980 | 1979 | 1993 | 9 | 130 us | 4 | 32 |
NvAtan2TwoPi_1024 | 2219 | 2219 | 2219 | 4 | 1 ms | 0 | 32 |
NvAtan2TwoPi_4096 | 2219 | 2219 | 2219 | 4 | 4 ms | 0 | 32 |
NvAtan2TwoPow30_1 | 2035 | 2032 | 2183 | 38 | 13 us | 24 | 32 |
NvAtan2TwoPow30_128 | 1980 | 1979 | 1993 | 9 | 132 us | 4 | 32 |
NvAtan2TwoPow30_1024 | 2218 | 2218 | 2219 | 5 | 1 ms | 0 | 32 |
NvAtan2TwoPow30_4096 | 2219 | 2219 | 2219 | 4 | 4 ms | 0 | 32 |
NvAtan2Large_1 | 2039 | 2032 | 2356 | 41 | 13 us | 49 | 32 |
NvAtan2Large_128 | 1980 | 1979 | 1998 | 11 | 132 us | 5 | 32 |
NvAtan2Large_1024 | 2218 | 2218 | 2219 | 4 | 1 ms | 0 | 32 |
NvAtan2Large_4096 | 2219 | 2219 | 2220 | 4 | 4 ms | 0 | 32 | |
Here's what I get on my AMD GPU.
|
@llvm/pr-subscribers-libc Author: Leandro Lacerda (leandrolcampos) ChangesThis patch improves the GPU benchmarking in this way:
TODO (before merge)
Follow-ups (future PRs)
Patch is 34.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153512.diff 13 Files Affected:
diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ec64bf270b53..ce3b0228c2076 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -22,8 +22,6 @@ function(add_benchmark benchmark_name)
${BENCHMARK_LINK_LIBRARIES}
DEPENDS
libc.src.stdio.printf
- libc.src.stdlib.srand
- libc.src.stdlib.rand
${BENCHMARK_DEPENDS}
${BENCHMARK_UNPARSED_ARGUMENTS}
COMPILE_OPTIONS
@@ -64,8 +62,6 @@ add_unittest_framework_library(
libc.src.__support.FPUtil.sqrt
libc.src.__support.fixedvector
libc.src.time.clock
- libc.src.stdlib.rand
- libc.src.stdlib.srand
libc.benchmarks.gpu.timing.timing
libc.src.stdio.printf
)
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
index 57ff5b9fdb846..28a4ebfc6df19 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
@@ -1,4 +1,5 @@
#include "LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/__support/CPP/algorithm.h"
#include "src/__support/CPP/array.h"
#include "src/__support/CPP/atomic.h"
@@ -9,7 +10,6 @@
#include "src/__support/macros/config.h"
#include "src/__support/time/gpu/time_utils.h"
#include "src/stdio/printf.h"
-#include "src/stdlib/srand.h"
namespace LIBC_NAMESPACE_DECL {
namespace benchmarks {
@@ -139,10 +139,8 @@ void print_header() {
void Benchmark::run_benchmarks() {
uint64_t id = gpu::get_thread_id();
- if (id == 0) {
+ if (id == 0)
print_header();
- LIBC_NAMESPACE::srand(gpu::processor_clock());
- }
gpu::sync_threads();
@@ -163,70 +161,72 @@ void Benchmark::run_benchmarks() {
gpu::sync_threads();
}
-BenchmarkResult benchmark(const BenchmarkOptions &options,
- cpp::function<uint64_t(void)> wrapper_func) {
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+ const cpp::function<uint64_t(uint32_t)> &wrapper_func) {
BenchmarkResult result;
RuntimeEstimationProgression rep;
- uint32_t total_iterations = 0;
uint32_t iterations = options.initial_iterations;
+
if (iterations < 1u)
iterations = 1;
uint32_t samples = 0;
uint64_t total_time = 0;
- uint64_t best_guess = 0;
- uint64_t cycles_squared = 0;
uint64_t min = UINT64_MAX;
uint64_t max = 0;
- uint64_t overhead = UINT64_MAX;
- int overhead_iterations = 10;
- for (int i = 0; i < overhead_iterations; i++)
- overhead = cpp::min(overhead, LIBC_NAMESPACE::overhead());
+ uint32_t call_index = 0;
for (int64_t time_budget = options.max_duration; time_budget >= 0;) {
- uint64_t sample_cycles = 0;
- const clock_t start = static_cast<double>(clock());
- for (uint32_t i = 0; i < iterations; i++) {
- auto wrapper_intermediate = wrapper_func();
- uint64_t current_result = wrapper_intermediate - overhead;
+ RefinableRuntimeEstimator sample_estimator;
+
+ const clock_t start = clock();
+ while (sample_estimator.get_iterations() < iterations) {
+ auto current_result = wrapper_func(call_index++);
max = cpp::max(max, current_result);
min = cpp::min(min, current_result);
- sample_cycles += current_result;
+ sample_estimator.update(current_result);
}
const clock_t end = clock();
+
const clock_t duration_ns =
((end - start) * 1000 * 1000 * 1000) / CLOCKS_PER_SEC;
total_time += duration_ns;
time_budget -= duration_ns;
samples++;
- cycles_squared += sample_cycles * sample_cycles;
- total_iterations += iterations;
- const double change_ratio =
- rep.compute_improvement({iterations, sample_cycles});
- best_guess = rep.current_estimation;
+ const double change_ratio = rep.compute_improvement(sample_estimator);
if (samples >= options.max_samples || iterations >= options.max_iterations)
break;
+
+ const auto total_iterations = rep.get_estimator().get_iterations();
+
if (total_time >= options.min_duration && samples >= options.min_samples &&
total_iterations >= options.min_iterations &&
change_ratio < options.epsilon)
break;
- iterations *= options.scaling_factor;
+ iterations = static_cast<uint32_t>(iterations * options.scaling_factor);
}
- result.cycles = best_guess;
- result.standard_deviation = fputil::sqrt<double>(
- static_cast<double>(cycles_squared) / total_iterations -
- static_cast<double>(best_guess * best_guess));
+
+ const auto &estimator = rep.get_estimator();
+ result.cycles = static_cast<uint64_t>(estimator.get_mean());
+ result.standard_deviation = estimator.get_stddev();
+
result.min = min;
result.max = max;
result.samples = samples;
- result.total_iterations = total_iterations;
- result.total_time = total_time / total_iterations;
+
+ result.total_iterations = estimator.get_iterations();
+ if (result.total_iterations > 0)
+ result.total_time = total_time / result.total_iterations;
+ else
+ result.total_time = 0;
+
return result;
-};
+}
} // namespace benchmarks
} // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.h b/libc/benchmarks/gpu/LibcGpuBenchmark.h
index a6cf62dd30ce5..c4088d90f80fa 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.h
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.h
@@ -4,14 +4,15 @@
#include "benchmarks/gpu/BenchmarkLogger.h"
#include "benchmarks/gpu/timing/timing.h"
#include "hdr/stdint_proxy.h"
+#include "src/__support/CPP/algorithm.h"
#include "src/__support/CPP/array.h"
#include "src/__support/CPP/functional.h"
#include "src/__support/CPP/limits.h"
#include "src/__support/CPP/string_view.h"
#include "src/__support/CPP/type_traits.h"
#include "src/__support/FPUtil/FPBits.h"
+#include "src/__support/FPUtil/sqrt.h"
#include "src/__support/macros/config.h"
-#include "src/stdlib/rand.h"
#include "src/time/clock.h"
namespace LIBC_NAMESPACE_DECL {
@@ -30,40 +31,82 @@ struct BenchmarkOptions {
double scaling_factor = 1.4;
};
-struct Measurement {
+class RefinableRuntimeEstimator {
uint32_t iterations = 0;
- uint64_t elapsed_cycles = 0;
-};
-
-class RefinableRuntimeEstimation {
- uint64_t total_cycles = 0;
- uint32_t total_iterations = 0;
+ uint64_t sum_of_cycles = 0;
+ uint64_t sum_of_squared_cycles = 0;
public:
- uint64_t update(const Measurement &M) {
- total_cycles += M.elapsed_cycles;
- total_iterations += M.iterations;
- return total_cycles / total_iterations;
+ void update(uint64_t cycles) noexcept {
+ iterations += 1;
+ sum_of_cycles += cycles;
+ sum_of_squared_cycles += cycles * cycles;
+ }
+
+ void update(const RefinableRuntimeEstimator &other) noexcept {
+ iterations += other.iterations;
+ sum_of_cycles += other.sum_of_cycles;
+ sum_of_squared_cycles += other.sum_of_squared_cycles;
}
+
+ double get_mean() const noexcept {
+ if (iterations == 0)
+ return 0.0;
+
+ return static_cast<double>(sum_of_cycles) / iterations;
+ }
+
+ double get_variance() const noexcept {
+ if (iterations == 0)
+ return 0.0;
+
+ const double num = static_cast<double>(iterations);
+ const double sum_x = static_cast<double>(sum_of_cycles);
+ const double sum_x2 = static_cast<double>(sum_of_squared_cycles);
+
+ const double mean_of_squares = sum_x2 / num;
+ const double mean = sum_x / num;
+ const double mean_squared = mean * mean;
+ const double variance = mean_of_squares - mean_squared;
+
+ return variance < 0.0 ? 0.0 : variance;
+ }
+
+ double get_stddev() const noexcept {
+ return fputil::sqrt<double>(get_variance());
+ }
+
+ uint32_t get_iterations() const noexcept { return iterations; }
};
// Tracks the progression of the runtime estimation
class RuntimeEstimationProgression {
- RefinableRuntimeEstimation rre;
+ RefinableRuntimeEstimator estimator;
+ double current_mean = 0.0;
public:
- uint64_t current_estimation = 0;
+ const RefinableRuntimeEstimator &get_estimator() const noexcept {
+ return estimator;
+ }
- double compute_improvement(const Measurement &M) {
- const uint64_t new_estimation = rre.update(M);
- double ratio =
- (static_cast<double>(current_estimation) / new_estimation) - 1.0;
+ double
+ compute_improvement(const RefinableRuntimeEstimator &sample_estimator) {
+ if (sample_estimator.get_iterations() == 0)
+ return 1.0;
- // Get absolute value
+ estimator.update(sample_estimator);
+
+ const double new_mean = estimator.get_mean();
+ if (current_mean == 0.0 || new_mean == 0.0) {
+ current_mean = new_mean;
+ return 1.0;
+ }
+
+ double ratio = (current_mean / new_mean) - 1.0;
if (ratio < 0)
- ratio *= -1;
+ ratio = -ratio;
- current_estimation = new_estimation;
+ current_mean = new_mean;
return ratio;
}
};
@@ -78,17 +121,18 @@ struct BenchmarkResult {
clock_t total_time = 0;
};
-BenchmarkResult benchmark(const BenchmarkOptions &options,
- cpp::function<uint64_t(void)> wrapper_func);
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+ const cpp::function<uint64_t(uint32_t)> &wrapper_func);
class Benchmark {
- const cpp::function<uint64_t(void)> func;
+ const cpp::function<uint64_t(uint32_t)> func;
const cpp::string_view suite_name;
const cpp::string_view test_name;
const uint32_t num_threads;
public:
- Benchmark(cpp::function<uint64_t(void)> func, char const *suite_name,
+ Benchmark(cpp::function<uint64_t(uint32_t)> func, char const *suite_name,
char const *test_name, uint32_t num_threads)
: func(func), suite_name(suite_name), test_name(test_name),
num_threads(num_threads) {
@@ -109,63 +153,135 @@ class Benchmark {
}
};
-// We want our random values to be approximately
-// Output: a random number with the exponent field between min_exp and max_exp,
-// i.e. 2^min_exp <= |real_value| < 2^(max_exp + 1),
-// Caveats:
-// -EXP_BIAS corresponding to denormal values,
-// EXP_BIAS + 1 corresponding to inf or nan.
+class RandomGenerator {
+ uint64_t state;
+
+ static LIBC_INLINE uint64_t splitmix64(uint64_t x) noexcept {
+ x += 0x9E3779B97F4A7C15ULL;
+ x = (x ^ (x >> 30)) * 0xBF58476D1CE4E5B9ULL;
+ x = (x ^ (x >> 27)) * 0x94D049BB133111EBULL;
+ x = (x ^ (x >> 31));
+ return x ? x : 0x9E3779B97F4A7C15ULL;
+ }
+
+public:
+ explicit LIBC_INLINE RandomGenerator(uint64_t seed) noexcept
+ : state(splitmix64(seed)) {}
+
+ LIBC_INLINE uint64_t next64() noexcept {
+ uint64_t x = state;
+ x ^= x >> 12;
+ x ^= x << 25;
+ x ^= x >> 27;
+ state = x;
+ return x * 0x2545F4914F6CDD1DULL;
+ }
+
+ LIBC_INLINE uint32_t next32() noexcept {
+ return static_cast<uint32_t>(next64() >> 32);
+ }
+};
+
+// We want random floating-point values whose *unbiased* exponent e is
+// approximately uniform in [min_exp, max_exp]. That is,
+// 2^min_exp <= |value| < 2^(max_exp + 1).
+// Caveats / boundaries:
+// - e = -EXP_BIAS ==> subnormal range (biased exponent = 0). We ensure a
+// non-zero mantissa so we don't accidentally produce 0.
+// - e in [1 - EXP_BIAS, EXP_BIAS] ==> normal numbers.
+// - e = EXP_BIAS + 1 ==> Inf/NaN. We do not include it by default; max_exp
+// defaults to EXP_BIAS.
template <typename T>
static T
-get_rand_input(int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
- int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
+get_rand_input(RandomGenerator &rng,
+ int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
+ int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
using FPBits = LIBC_NAMESPACE::fputil::FPBits<T>;
-
- // Required to correctly instantiate FPBits for floats and doubles.
- using RandType = typename cpp::conditional_t<(cpp::is_same_v<T, double>),
- uint64_t, uint32_t>;
- RandType bits;
- if constexpr (cpp::is_same_v<T, uint64_t>)
- bits = (static_cast<uint64_t>(LIBC_NAMESPACE::rand()) << 32) |
- static_cast<uint64_t>(LIBC_NAMESPACE::rand());
- else
- bits = LIBC_NAMESPACE::rand();
- double scale =
- static_cast<double>(max_exp - min_exp + 1) / (2 * FPBits::EXP_BIAS + 1);
- FPBits fp(bits);
- fp.set_biased_exponent(
- static_cast<uint32_t>(fp.get_biased_exponent() * scale + min_exp));
- return fp.get_val();
+ using Storage = typename FPBits::StorageType;
+
+ // Sanitize and clamp requested range to what the format supports
+ if (min_exp > max_exp) {
+ auto tmp = min_exp;
+ min_exp = max_exp;
+ max_exp = tmp;
+ };
+ min_exp = cpp::max(min_exp, -FPBits::EXP_BIAS);
+ max_exp = cpp::min(max_exp, FPBits::EXP_BIAS);
+
+ // Sample unbiased exponent e uniformly in [min_exp, max_exp] without modulo
+ // bias
+ auto sample_in_range = [&](uint64_t r) -> int32_t {
+ const uint64_t range = static_cast<uint64_t>(
+ static_cast<int64_t>(max_exp) - static_cast<int64_t>(min_exp) + 1);
+ const uint64_t threshold = (-range) % range;
+ while (r < threshold)
+ r = rng.next64();
+ return static_cast<int32_t>(min_exp + static_cast<int64_t>(r % range));
+ };
+ const int32_t e = sample_in_range(rng.next64());
+
+ // Start from random bits to get random sign and mantissa
+ FPBits xbits([&] {
+ if constexpr (cpp::is_same_v<T, double>)
+ return FPBits(rng.next64());
+ else
+ return FPBits(rng.next32());
+ }());
+
+ if (e == -FPBits::EXP_BIAS) {
+ // Subnormal: biased exponent must be 0; ensure mantissa != 0 to avoid 0
+ xbits.set_biased_exponent(Storage(0));
+ if (xbits.get_mantissa() == Storage(0))
+ xbits.set_mantissa(Storage(1));
+ } else {
+ // Normal: biased exponent in [1, 2 * FPBits::EXP_BIAS]
+ const int32_t biased = e + FPBits::EXP_BIAS;
+ xbits.set_biased_exponent(static_cast<Storage>(biased));
+ }
+ return xbits.get_val();
}
template <typename T> class MathPerf {
- using FPBits = fputil::FPBits<T>;
- using StorageType = typename FPBits::StorageType;
- static constexpr StorageType UIntMax =
- cpp::numeric_limits<StorageType>::max();
+ static LIBC_INLINE uint64_t make_seed(uint64_t base_seed, uint64_t salt) {
+ const uint64_t tid = gpu::get_thread_id();
+ return base_seed ^ (salt << 32) ^ (tid * 0x9E3779B97F4A7C15ULL);
+ }
public:
+ // Returns cycles-per-call (lower is better)
template <size_t N = 1>
- static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp) {
+ static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp,
+ uint32_t call_index) {
cpp::array<T, N> inputs;
+
+ uint64_t base_seed = static_cast<uint64_t>(call_index);
+ uint64_t salt = static_cast<uint64_t>(N);
+ RandomGenerator rng(make_seed(base_seed, salt));
+
for (size_t i = 0; i < N; ++i)
- inputs[i] = get_rand_input<T>(min_exp, max_exp);
+ inputs[i] = get_rand_input<T>(rng, min_exp, max_exp);
uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs);
return total_time / N;
}
- // Throughput benchmarking for functions that take 2 inputs.
+ // Returns cycles-per-call (lower is better)
template <size_t N = 1>
static uint64_t run_throughput_in_range(T f(T, T), int arg1_min_exp,
int arg1_max_exp, int arg2_min_exp,
- int arg2_max_exp) {
+ int arg2_max_exp,
+ uint32_t call_index) {
cpp::array<T, N> inputs1;
cpp::array<T, N> inputs2;
+
+ uint64_t base_seed = static_cast<uint64_t>(call_index);
+ uint64_t salt = static_cast<uint64_t>(N);
+ RandomGenerator rng(make_seed(base_seed, salt));
+
for (size_t i = 0; i < N; ++i) {
- inputs1[i] = get_rand_input<T>(arg1_min_exp, arg1_max_exp);
- inputs2[i] = get_rand_input<T>(arg2_min_exp, arg2_max_exp);
+ inputs1[i] = get_rand_input<T>(rng, arg1_min_exp, arg1_max_exp);
+ inputs2[i] = get_rand_input<T>(rng, arg2_min_exp, arg2_max_exp);
}
uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs1, inputs2);
@@ -193,4 +309,5 @@ template <typename T> class MathPerf {
#define SINGLE_WAVE_BENCHMARK(SuiteName, TestName, Func) \
BENCHMARK_N_THREADS(SuiteName, TestName, Func, \
LIBC_NAMESPACE::gpu::get_lane_size())
-#endif
+
+#endif // LLVM_LIBC_BENCHMARKS_LIBC_GPU_BENCHMARK_H
diff --git a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
index f277624dbb901..77e2bbe538b1f 100644
--- a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
@@ -7,6 +7,7 @@ add_benchmark(
SRCS
isalnum_benchmark.cpp
DEPENDS
+ libc.hdr.stdint_proxy
libc.src.ctype.isalnum
LOADER_ARGS
--threads 64
@@ -19,5 +20,6 @@ add_benchmark(
SRCS
isalpha_benchmark.cpp
DEPENDS
+ libc.hdr.stdint_proxy
libc.src.ctype.isalpha
)
diff --git a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
index ffa5a99860bfc..28b1ee52c8dfa 100644
--- a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
@@ -1,8 +1,9 @@
#include "benchmarks/gpu/LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/ctype/isalnum.h"
-uint64_t BM_IsAlnum() {
+uint64_t BM_IsAlnum(uint32_t /*call_index*/) {
char x = 'c';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
}
@@ -12,13 +13,13 @@ SINGLE_THREADED_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleThread,
SINGLE_WAVE_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleWave,
BM_IsAlnum);
-uint64_t BM_IsAlnumCapital() {
+uint64_t BM_IsAlnumCapital(uint32_t /*call_index*/) {
char x = 'A';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
}
BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumCapital, BM_IsAlnumCapital);
-uint64_t BM_IsAlnumNotAlnum() {
+uint64_t BM_IsAlnumNotAlnum(uint32_t /*call_index*/) {
char x = '{';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
}
diff --git a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
index 2038eb89bc77b..bff4edea8b690 100644
--- a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
@@ -1,8 +1,9 @@
#include "benchmarks/gpu/LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/ctype/isalpha.h"
-uint64_t BM_IsAlpha() {
+uint64_t BM_IsAlpha(uint32_t /*call_index*/) {
char x = 'c';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalpha, x);
}
diff --git a/libc/benchmarks/gpu/src/math/CMakeLists.txt b/libc/benchmarks/gpu/src/math/CMakeLists.txt
index 7a12ce4e61c9e..8417f23c124a0 100644
--- a/libc/benchmarks/gpu/src/math/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/math/CMakeLists.txt
@@ -34,11 +34,6 @@ add_benchmark(
libc.hdr.stdint_proxy
libc.src.math.sin
libc.src.math.sinf
- libc.src.stdlib.srand
- libc.src.stdlib.rand
- libc.src.__support.FPUtil.fp_bits
- libc.src.__support.CPP.bit
- libc.src.__support.CPP.array
COMPILE_OPTIONS
${math_benchmark_flags}
LOADER_ARGS
@@ -54,11 +49,6 @@ add_benchmark(
DEPENDS
libc.hdr.stdint_proxy
libc.src.math.atan2
- libc.src.stdlib.srand
- libc.src.stdlib.rand
- libc.src.__support.FPUtil.fp_bits
- libc.src.__support.CPP.bit
- libc.src.__support.CPP.array
COMPILE_OPTIONS
${math_benchmark_flags}
LOADER_ARGS
diff --git a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
index 1f91a9a35c373..82bb0c5d7de49 100644
--- a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
@@ -1,27 +1,27 @@
#include "benchmarks/gpu/LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/math/atan2.h"
-#include "src/stdlib/rand.h"
#if defined(NVPTX_MATH_FOUND) || defined(AMDGPU_MATH_FOUND)
#include "platform.h"
#endif
-#define BM_TWO_RANDOM_INPUT(T, Func, MIN_EXP, MAX_EXP, N) ...
[truncated]
|
@llvm/pr-subscribers-backend-amdgpu Author: Leandro Lacerda (leandrolcampos) ChangesThis patch improves the GPU benchmarking in this way:
TODO (before merge)
Follow-ups (future PRs)
Patch is 34.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153512.diff 13 Files Affected:
diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index 6ec64bf270b53..ce3b0228c2076 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -22,8 +22,6 @@ function(add_benchmark benchmark_name)
${BENCHMARK_LINK_LIBRARIES}
DEPENDS
libc.src.stdio.printf
- libc.src.stdlib.srand
- libc.src.stdlib.rand
${BENCHMARK_DEPENDS}
${BENCHMARK_UNPARSED_ARGUMENTS}
COMPILE_OPTIONS
@@ -64,8 +62,6 @@ add_unittest_framework_library(
libc.src.__support.FPUtil.sqrt
libc.src.__support.fixedvector
libc.src.time.clock
- libc.src.stdlib.rand
- libc.src.stdlib.srand
libc.benchmarks.gpu.timing.timing
libc.src.stdio.printf
)
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
index 57ff5b9fdb846..28a4ebfc6df19 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
@@ -1,4 +1,5 @@
#include "LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/__support/CPP/algorithm.h"
#include "src/__support/CPP/array.h"
#include "src/__support/CPP/atomic.h"
@@ -9,7 +10,6 @@
#include "src/__support/macros/config.h"
#include "src/__support/time/gpu/time_utils.h"
#include "src/stdio/printf.h"
-#include "src/stdlib/srand.h"
namespace LIBC_NAMESPACE_DECL {
namespace benchmarks {
@@ -139,10 +139,8 @@ void print_header() {
void Benchmark::run_benchmarks() {
uint64_t id = gpu::get_thread_id();
- if (id == 0) {
+ if (id == 0)
print_header();
- LIBC_NAMESPACE::srand(gpu::processor_clock());
- }
gpu::sync_threads();
@@ -163,70 +161,72 @@ void Benchmark::run_benchmarks() {
gpu::sync_threads();
}
-BenchmarkResult benchmark(const BenchmarkOptions &options,
- cpp::function<uint64_t(void)> wrapper_func) {
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+ const cpp::function<uint64_t(uint32_t)> &wrapper_func) {
BenchmarkResult result;
RuntimeEstimationProgression rep;
- uint32_t total_iterations = 0;
uint32_t iterations = options.initial_iterations;
+
if (iterations < 1u)
iterations = 1;
uint32_t samples = 0;
uint64_t total_time = 0;
- uint64_t best_guess = 0;
- uint64_t cycles_squared = 0;
uint64_t min = UINT64_MAX;
uint64_t max = 0;
- uint64_t overhead = UINT64_MAX;
- int overhead_iterations = 10;
- for (int i = 0; i < overhead_iterations; i++)
- overhead = cpp::min(overhead, LIBC_NAMESPACE::overhead());
+ uint32_t call_index = 0;
for (int64_t time_budget = options.max_duration; time_budget >= 0;) {
- uint64_t sample_cycles = 0;
- const clock_t start = static_cast<double>(clock());
- for (uint32_t i = 0; i < iterations; i++) {
- auto wrapper_intermediate = wrapper_func();
- uint64_t current_result = wrapper_intermediate - overhead;
+ RefinableRuntimeEstimator sample_estimator;
+
+ const clock_t start = clock();
+ while (sample_estimator.get_iterations() < iterations) {
+ auto current_result = wrapper_func(call_index++);
max = cpp::max(max, current_result);
min = cpp::min(min, current_result);
- sample_cycles += current_result;
+ sample_estimator.update(current_result);
}
const clock_t end = clock();
+
const clock_t duration_ns =
((end - start) * 1000 * 1000 * 1000) / CLOCKS_PER_SEC;
total_time += duration_ns;
time_budget -= duration_ns;
samples++;
- cycles_squared += sample_cycles * sample_cycles;
- total_iterations += iterations;
- const double change_ratio =
- rep.compute_improvement({iterations, sample_cycles});
- best_guess = rep.current_estimation;
+ const double change_ratio = rep.compute_improvement(sample_estimator);
if (samples >= options.max_samples || iterations >= options.max_iterations)
break;
+
+ const auto total_iterations = rep.get_estimator().get_iterations();
+
if (total_time >= options.min_duration && samples >= options.min_samples &&
total_iterations >= options.min_iterations &&
change_ratio < options.epsilon)
break;
- iterations *= options.scaling_factor;
+ iterations = static_cast<uint32_t>(iterations * options.scaling_factor);
}
- result.cycles = best_guess;
- result.standard_deviation = fputil::sqrt<double>(
- static_cast<double>(cycles_squared) / total_iterations -
- static_cast<double>(best_guess * best_guess));
+
+ const auto &estimator = rep.get_estimator();
+ result.cycles = static_cast<uint64_t>(estimator.get_mean());
+ result.standard_deviation = estimator.get_stddev();
+
result.min = min;
result.max = max;
result.samples = samples;
- result.total_iterations = total_iterations;
- result.total_time = total_time / total_iterations;
+
+ result.total_iterations = estimator.get_iterations();
+ if (result.total_iterations > 0)
+ result.total_time = total_time / result.total_iterations;
+ else
+ result.total_time = 0;
+
return result;
-};
+}
} // namespace benchmarks
} // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.h b/libc/benchmarks/gpu/LibcGpuBenchmark.h
index a6cf62dd30ce5..c4088d90f80fa 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.h
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.h
@@ -4,14 +4,15 @@
#include "benchmarks/gpu/BenchmarkLogger.h"
#include "benchmarks/gpu/timing/timing.h"
#include "hdr/stdint_proxy.h"
+#include "src/__support/CPP/algorithm.h"
#include "src/__support/CPP/array.h"
#include "src/__support/CPP/functional.h"
#include "src/__support/CPP/limits.h"
#include "src/__support/CPP/string_view.h"
#include "src/__support/CPP/type_traits.h"
#include "src/__support/FPUtil/FPBits.h"
+#include "src/__support/FPUtil/sqrt.h"
#include "src/__support/macros/config.h"
-#include "src/stdlib/rand.h"
#include "src/time/clock.h"
namespace LIBC_NAMESPACE_DECL {
@@ -30,40 +31,82 @@ struct BenchmarkOptions {
double scaling_factor = 1.4;
};
-struct Measurement {
+class RefinableRuntimeEstimator {
uint32_t iterations = 0;
- uint64_t elapsed_cycles = 0;
-};
-
-class RefinableRuntimeEstimation {
- uint64_t total_cycles = 0;
- uint32_t total_iterations = 0;
+ uint64_t sum_of_cycles = 0;
+ uint64_t sum_of_squared_cycles = 0;
public:
- uint64_t update(const Measurement &M) {
- total_cycles += M.elapsed_cycles;
- total_iterations += M.iterations;
- return total_cycles / total_iterations;
+ void update(uint64_t cycles) noexcept {
+ iterations += 1;
+ sum_of_cycles += cycles;
+ sum_of_squared_cycles += cycles * cycles;
+ }
+
+ void update(const RefinableRuntimeEstimator &other) noexcept {
+ iterations += other.iterations;
+ sum_of_cycles += other.sum_of_cycles;
+ sum_of_squared_cycles += other.sum_of_squared_cycles;
}
+
+ double get_mean() const noexcept {
+ if (iterations == 0)
+ return 0.0;
+
+ return static_cast<double>(sum_of_cycles) / iterations;
+ }
+
+ double get_variance() const noexcept {
+ if (iterations == 0)
+ return 0.0;
+
+ const double num = static_cast<double>(iterations);
+ const double sum_x = static_cast<double>(sum_of_cycles);
+ const double sum_x2 = static_cast<double>(sum_of_squared_cycles);
+
+ const double mean_of_squares = sum_x2 / num;
+ const double mean = sum_x / num;
+ const double mean_squared = mean * mean;
+ const double variance = mean_of_squares - mean_squared;
+
+ return variance < 0.0 ? 0.0 : variance;
+ }
+
+ double get_stddev() const noexcept {
+ return fputil::sqrt<double>(get_variance());
+ }
+
+ uint32_t get_iterations() const noexcept { return iterations; }
};
// Tracks the progression of the runtime estimation
class RuntimeEstimationProgression {
- RefinableRuntimeEstimation rre;
+ RefinableRuntimeEstimator estimator;
+ double current_mean = 0.0;
public:
- uint64_t current_estimation = 0;
+ const RefinableRuntimeEstimator &get_estimator() const noexcept {
+ return estimator;
+ }
- double compute_improvement(const Measurement &M) {
- const uint64_t new_estimation = rre.update(M);
- double ratio =
- (static_cast<double>(current_estimation) / new_estimation) - 1.0;
+ double
+ compute_improvement(const RefinableRuntimeEstimator &sample_estimator) {
+ if (sample_estimator.get_iterations() == 0)
+ return 1.0;
- // Get absolute value
+ estimator.update(sample_estimator);
+
+ const double new_mean = estimator.get_mean();
+ if (current_mean == 0.0 || new_mean == 0.0) {
+ current_mean = new_mean;
+ return 1.0;
+ }
+
+ double ratio = (current_mean / new_mean) - 1.0;
if (ratio < 0)
- ratio *= -1;
+ ratio = -ratio;
- current_estimation = new_estimation;
+ current_mean = new_mean;
return ratio;
}
};
@@ -78,17 +121,18 @@ struct BenchmarkResult {
clock_t total_time = 0;
};
-BenchmarkResult benchmark(const BenchmarkOptions &options,
- cpp::function<uint64_t(void)> wrapper_func);
+BenchmarkResult
+benchmark(const BenchmarkOptions &options,
+ const cpp::function<uint64_t(uint32_t)> &wrapper_func);
class Benchmark {
- const cpp::function<uint64_t(void)> func;
+ const cpp::function<uint64_t(uint32_t)> func;
const cpp::string_view suite_name;
const cpp::string_view test_name;
const uint32_t num_threads;
public:
- Benchmark(cpp::function<uint64_t(void)> func, char const *suite_name,
+ Benchmark(cpp::function<uint64_t(uint32_t)> func, char const *suite_name,
char const *test_name, uint32_t num_threads)
: func(func), suite_name(suite_name), test_name(test_name),
num_threads(num_threads) {
@@ -109,63 +153,135 @@ class Benchmark {
}
};
-// We want our random values to be approximately
-// Output: a random number with the exponent field between min_exp and max_exp,
-// i.e. 2^min_exp <= |real_value| < 2^(max_exp + 1),
-// Caveats:
-// -EXP_BIAS corresponding to denormal values,
-// EXP_BIAS + 1 corresponding to inf or nan.
+class RandomGenerator {
+ uint64_t state;
+
+ static LIBC_INLINE uint64_t splitmix64(uint64_t x) noexcept {
+ x += 0x9E3779B97F4A7C15ULL;
+ x = (x ^ (x >> 30)) * 0xBF58476D1CE4E5B9ULL;
+ x = (x ^ (x >> 27)) * 0x94D049BB133111EBULL;
+ x = (x ^ (x >> 31));
+ return x ? x : 0x9E3779B97F4A7C15ULL;
+ }
+
+public:
+ explicit LIBC_INLINE RandomGenerator(uint64_t seed) noexcept
+ : state(splitmix64(seed)) {}
+
+ LIBC_INLINE uint64_t next64() noexcept {
+ uint64_t x = state;
+ x ^= x >> 12;
+ x ^= x << 25;
+ x ^= x >> 27;
+ state = x;
+ return x * 0x2545F4914F6CDD1DULL;
+ }
+
+ LIBC_INLINE uint32_t next32() noexcept {
+ return static_cast<uint32_t>(next64() >> 32);
+ }
+};
+
+// We want random floating-point values whose *unbiased* exponent e is
+// approximately uniform in [min_exp, max_exp]. That is,
+// 2^min_exp <= |value| < 2^(max_exp + 1).
+// Caveats / boundaries:
+// - e = -EXP_BIAS ==> subnormal range (biased exponent = 0). We ensure a
+// non-zero mantissa so we don't accidentally produce 0.
+// - e in [1 - EXP_BIAS, EXP_BIAS] ==> normal numbers.
+// - e = EXP_BIAS + 1 ==> Inf/NaN. We do not include it by default; max_exp
+// defaults to EXP_BIAS.
template <typename T>
static T
-get_rand_input(int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
- int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
+get_rand_input(RandomGenerator &rng,
+ int min_exp = -LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS,
+ int max_exp = LIBC_NAMESPACE::fputil::FPBits<T>::EXP_BIAS) {
using FPBits = LIBC_NAMESPACE::fputil::FPBits<T>;
-
- // Required to correctly instantiate FPBits for floats and doubles.
- using RandType = typename cpp::conditional_t<(cpp::is_same_v<T, double>),
- uint64_t, uint32_t>;
- RandType bits;
- if constexpr (cpp::is_same_v<T, uint64_t>)
- bits = (static_cast<uint64_t>(LIBC_NAMESPACE::rand()) << 32) |
- static_cast<uint64_t>(LIBC_NAMESPACE::rand());
- else
- bits = LIBC_NAMESPACE::rand();
- double scale =
- static_cast<double>(max_exp - min_exp + 1) / (2 * FPBits::EXP_BIAS + 1);
- FPBits fp(bits);
- fp.set_biased_exponent(
- static_cast<uint32_t>(fp.get_biased_exponent() * scale + min_exp));
- return fp.get_val();
+ using Storage = typename FPBits::StorageType;
+
+ // Sanitize and clamp requested range to what the format supports
+ if (min_exp > max_exp) {
+ auto tmp = min_exp;
+ min_exp = max_exp;
+ max_exp = tmp;
+ };
+ min_exp = cpp::max(min_exp, -FPBits::EXP_BIAS);
+ max_exp = cpp::min(max_exp, FPBits::EXP_BIAS);
+
+ // Sample unbiased exponent e uniformly in [min_exp, max_exp] without modulo
+ // bias
+ auto sample_in_range = [&](uint64_t r) -> int32_t {
+ const uint64_t range = static_cast<uint64_t>(
+ static_cast<int64_t>(max_exp) - static_cast<int64_t>(min_exp) + 1);
+ const uint64_t threshold = (-range) % range;
+ while (r < threshold)
+ r = rng.next64();
+ return static_cast<int32_t>(min_exp + static_cast<int64_t>(r % range));
+ };
+ const int32_t e = sample_in_range(rng.next64());
+
+ // Start from random bits to get random sign and mantissa
+ FPBits xbits([&] {
+ if constexpr (cpp::is_same_v<T, double>)
+ return FPBits(rng.next64());
+ else
+ return FPBits(rng.next32());
+ }());
+
+ if (e == -FPBits::EXP_BIAS) {
+ // Subnormal: biased exponent must be 0; ensure mantissa != 0 to avoid 0
+ xbits.set_biased_exponent(Storage(0));
+ if (xbits.get_mantissa() == Storage(0))
+ xbits.set_mantissa(Storage(1));
+ } else {
+ // Normal: biased exponent in [1, 2 * FPBits::EXP_BIAS]
+ const int32_t biased = e + FPBits::EXP_BIAS;
+ xbits.set_biased_exponent(static_cast<Storage>(biased));
+ }
+ return xbits.get_val();
}
template <typename T> class MathPerf {
- using FPBits = fputil::FPBits<T>;
- using StorageType = typename FPBits::StorageType;
- static constexpr StorageType UIntMax =
- cpp::numeric_limits<StorageType>::max();
+ static LIBC_INLINE uint64_t make_seed(uint64_t base_seed, uint64_t salt) {
+ const uint64_t tid = gpu::get_thread_id();
+ return base_seed ^ (salt << 32) ^ (tid * 0x9E3779B97F4A7C15ULL);
+ }
public:
+ // Returns cycles-per-call (lower is better)
template <size_t N = 1>
- static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp) {
+ static uint64_t run_throughput_in_range(T f(T), int min_exp, int max_exp,
+ uint32_t call_index) {
cpp::array<T, N> inputs;
+
+ uint64_t base_seed = static_cast<uint64_t>(call_index);
+ uint64_t salt = static_cast<uint64_t>(N);
+ RandomGenerator rng(make_seed(base_seed, salt));
+
for (size_t i = 0; i < N; ++i)
- inputs[i] = get_rand_input<T>(min_exp, max_exp);
+ inputs[i] = get_rand_input<T>(rng, min_exp, max_exp);
uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs);
return total_time / N;
}
- // Throughput benchmarking for functions that take 2 inputs.
+ // Returns cycles-per-call (lower is better)
template <size_t N = 1>
static uint64_t run_throughput_in_range(T f(T, T), int arg1_min_exp,
int arg1_max_exp, int arg2_min_exp,
- int arg2_max_exp) {
+ int arg2_max_exp,
+ uint32_t call_index) {
cpp::array<T, N> inputs1;
cpp::array<T, N> inputs2;
+
+ uint64_t base_seed = static_cast<uint64_t>(call_index);
+ uint64_t salt = static_cast<uint64_t>(N);
+ RandomGenerator rng(make_seed(base_seed, salt));
+
for (size_t i = 0; i < N; ++i) {
- inputs1[i] = get_rand_input<T>(arg1_min_exp, arg1_max_exp);
- inputs2[i] = get_rand_input<T>(arg2_min_exp, arg2_max_exp);
+ inputs1[i] = get_rand_input<T>(rng, arg1_min_exp, arg1_max_exp);
+ inputs2[i] = get_rand_input<T>(rng, arg2_min_exp, arg2_max_exp);
}
uint64_t total_time = LIBC_NAMESPACE::throughput(f, inputs1, inputs2);
@@ -193,4 +309,5 @@ template <typename T> class MathPerf {
#define SINGLE_WAVE_BENCHMARK(SuiteName, TestName, Func) \
BENCHMARK_N_THREADS(SuiteName, TestName, Func, \
LIBC_NAMESPACE::gpu::get_lane_size())
-#endif
+
+#endif // LLVM_LIBC_BENCHMARKS_LIBC_GPU_BENCHMARK_H
diff --git a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
index f277624dbb901..77e2bbe538b1f 100644
--- a/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/ctype/CMakeLists.txt
@@ -7,6 +7,7 @@ add_benchmark(
SRCS
isalnum_benchmark.cpp
DEPENDS
+ libc.hdr.stdint_proxy
libc.src.ctype.isalnum
LOADER_ARGS
--threads 64
@@ -19,5 +20,6 @@ add_benchmark(
SRCS
isalpha_benchmark.cpp
DEPENDS
+ libc.hdr.stdint_proxy
libc.src.ctype.isalpha
)
diff --git a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
index ffa5a99860bfc..28b1ee52c8dfa 100644
--- a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
@@ -1,8 +1,9 @@
#include "benchmarks/gpu/LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/ctype/isalnum.h"
-uint64_t BM_IsAlnum() {
+uint64_t BM_IsAlnum(uint32_t /*call_index*/) {
char x = 'c';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
}
@@ -12,13 +13,13 @@ SINGLE_THREADED_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleThread,
SINGLE_WAVE_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleWave,
BM_IsAlnum);
-uint64_t BM_IsAlnumCapital() {
+uint64_t BM_IsAlnumCapital(uint32_t /*call_index*/) {
char x = 'A';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
}
BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumCapital, BM_IsAlnumCapital);
-uint64_t BM_IsAlnumNotAlnum() {
+uint64_t BM_IsAlnumNotAlnum(uint32_t /*call_index*/) {
char x = '{';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
}
diff --git a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
index 2038eb89bc77b..bff4edea8b690 100644
--- a/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalpha_benchmark.cpp
@@ -1,8 +1,9 @@
#include "benchmarks/gpu/LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/ctype/isalpha.h"
-uint64_t BM_IsAlpha() {
+uint64_t BM_IsAlpha(uint32_t /*call_index*/) {
char x = 'c';
return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalpha, x);
}
diff --git a/libc/benchmarks/gpu/src/math/CMakeLists.txt b/libc/benchmarks/gpu/src/math/CMakeLists.txt
index 7a12ce4e61c9e..8417f23c124a0 100644
--- a/libc/benchmarks/gpu/src/math/CMakeLists.txt
+++ b/libc/benchmarks/gpu/src/math/CMakeLists.txt
@@ -34,11 +34,6 @@ add_benchmark(
libc.hdr.stdint_proxy
libc.src.math.sin
libc.src.math.sinf
- libc.src.stdlib.srand
- libc.src.stdlib.rand
- libc.src.__support.FPUtil.fp_bits
- libc.src.__support.CPP.bit
- libc.src.__support.CPP.array
COMPILE_OPTIONS
${math_benchmark_flags}
LOADER_ARGS
@@ -54,11 +49,6 @@ add_benchmark(
DEPENDS
libc.hdr.stdint_proxy
libc.src.math.atan2
- libc.src.stdlib.srand
- libc.src.stdlib.rand
- libc.src.__support.FPUtil.fp_bits
- libc.src.__support.CPP.bit
- libc.src.__support.CPP.array
COMPILE_OPTIONS
${math_benchmark_flags}
LOADER_ARGS
diff --git a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
index 1f91a9a35c373..82bb0c5d7de49 100644
--- a/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/math/atan2_benchmark.cpp
@@ -1,27 +1,27 @@
#include "benchmarks/gpu/LibcGpuBenchmark.h"
+#include "hdr/stdint_proxy.h"
#include "src/math/atan2.h"
-#include "src/stdlib/rand.h"
#if defined(NVPTX_MATH_FOUND) || defined(AMDGPU_MATH_FOUND)
#include "platform.h"
#endif
-#define BM_TWO_RANDOM_INPUT(T, Func, MIN_EXP, MAX_EXP, N) ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, thanks for fixing this!
This patch improves the GPU benchmarking in this way:
rand
/srand
with a deterministic per-thread RNG seeded bycall_index
: reproducible, apples-to-apples libc vs vendor comparisons.[min_exp, max_exp]
, clamp bounds, and skipInf
,NaN
,-0.0
, and+0.0
.sqrt(E[x^2] − E[x]^2)
) across samples.benchmark()
gets cycles-per-call already corrected (nooverhead()
call).call_index
, droprand/srand
, clean includes).Cycles (Mean)
andStddev
.Time / Iteration
column from the results table: it reported per-thread convergence time (not per-call latency) and was redundant/misleading next toCycles (Mean)
.