Skip to content

Commit dc6e246

Browse files
committed
Docs: Put build notes in the end
1 parent 33c6b59 commit dc6e246

File tree

3 files changed

+100
-79
lines changed

3 files changed

+100
-79
lines changed

.vscode/settings.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,7 @@
137137
"utility": "cpp",
138138
"valarray": "cpp",
139139
"variant": "cpp",
140-
"vector": "cpp"
140+
"vector": "cpp",
141+
"source_location": "cpp"
141142
}
142143
}

README.md

Lines changed: 88 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
# Parallel Reductions Benchmark for CPUs & GPUs
1+
# Parallel Reductions Benchmark
2+
3+
__For CPUs and GPUs in C++, CUDA, and Rust__
24

35
![Parallel Reductions Benchmark](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/ParallelReductionsBenchmark.jpg?raw=true)
46

@@ -10,14 +12,16 @@ This repository contains several educational examples showcasing the performance
1012
- Single-threaded but SIMD-accelerated code:
1113
- SSE, AVX, AVX-512 on x86.
1214
- NEON and SVE on Arm.
13-
- OpenMP `reduction` clause.
15+
- OpenMP `reduction` clause vs manual `omp parallel` scheduling.
1416
- Thrust with its `thrust::reduce`.
1517
- CUB with its `cub::DeviceReduce::Sum`.
1618
- CUDA kernels with and w/out [warp-primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/).
1719
- CUDA kernels with [Tensor-Core](https://www.nvidia.com/en-gb/data-center/tensor-cores/) acceleration.
1820
- [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and cuBLAS strided vector and matrix routines.
1921
- OpenCL kernels, eight of them.
2022
- Parallel STL `<algorithm>` in GCC with Intel oneTBB.
23+
- Reusable thread-pool libraries for C++, like [Taskflow](https://github.com/taskflow/taskflow).
24+
- Reusable thread-pool libraries for Rust, like [Rayon](https://github.com/rayon-rs/rayon) and [Tokio](https://github.com/tokio-rs/tokio).
2125

2226
Notably:
2327

@@ -35,79 +39,6 @@ Previously, it also included ArrayFire, Halide, and Vulkan queues for SPIR-V ker
3539
- [CppRussia Talk](https://youtu.be/AA4RI6o0h1U) in Russia in 2019.
3640
- [JetBrains Talk](https://youtu.be/BUtHOftDm_Y) in Germany & Russia in 2019.
3741

38-
## Build & Run
39-
40-
### C++
41-
42-
This repository is a CMake project designed to be built on Linux with GCC, Clang, or NVCC.
43-
You may need to install the following dependencies for complete functionality:
44-
45-
```sh
46-
sudo apt install libblas-dev # For OpenBLAS on Linux
47-
sudo apt install libnuma1 libnuma-dev # For NUMA allocators on Linux
48-
sudo apt install cuda-toolkit # This may not be as easy 😈
49-
```
50-
51-
The following script will, by default, generate a 1GB array of numbers and reduce them using every available backend.
52-
All the classical Google Benchmark arguments are supported, including `--benchmark_filter=opencl`.
53-
All the library dependencies, including GTest, GBench, Intel oneTBB, FMT, and Thrust with CUB, will be automatically fetched.
54-
You are expected to build this on an x86 machine with CUDA drivers installed.
55-
56-
```sh
57-
cmake -B build_release -D CMAKE_BUILD_TYPE=Release # Generate the build files
58-
cmake --build build_release --config Release -j # Build the project
59-
build_release/reduce_bench # Run all benchmarks
60-
build_release/reduce_bench --benchmark_filter="cuda" # Only CUDA-related
61-
PARALLEL_REDUCTIONS_LENGTH=1024 build_release/reduce_bench # Set a different input size
62-
```
63-
64-
Need a more fine-grained control to run only CUDA-based backends?
65-
66-
```sh
67-
cmake -D CMAKE_CUDA_COMPILER=nvcc -D CMAKE_C_COMPILER=gcc-12 -D CMAKE_CXX_COMPILER=g++-12 -B build_release
68-
cmake --build build_release --config Release -j
69-
build_release/reduce_bench --benchmark_filter=cuda
70-
```
71-
72-
Want to use the non-default Clang distribution on macOS?
73-
OpenBLAS will be superseded by Apple's `Accelerate.framework`, but LLVM and OpenMP should ideally be pulled from Homebrew:
74-
75-
```sh
76-
brew install llvm libomp
77-
cmake -B build_release \
78-
-D CMAKE_CXX_COMPILER=$(brew --prefix llvm)/bin/clang++ \
79-
-D OpenMP_ROOT=$(brew --prefix llvm) \
80-
-D CMAKE_BUILD_RPATH=$(brew --prefix llvm)/lib \
81-
-D CMAKE_INSTALL_RPATH=$(brew --prefix llvm)/lib
82-
cmake --build build_release --config Release -j
83-
build_release/reduce_bench
84-
```
85-
86-
To debug or introspect, the procedure is similar:
87-
88-
```sh
89-
cmake -D CMAKE_CUDA_COMPILER=nvcc -D CMAKE_C_COMPILER=gcc -D CMAKE_CXX_COMPILER=g++ -D CMAKE_BUILD_TYPE=Debug -B build_debug
90-
cmake --build build_debug --config Debug
91-
```
92-
93-
And then run your favorite debugger.
94-
95-
Optional backends:
96-
97-
- To enable [Intel OpenCL](https://github.com/intel/compute-runtime/blob/master/README.md) on CPUs: `apt-get install intel-opencl-icd`.
98-
- To run on integrated Intel GPU, follow [this guide](https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/prerequisites.html).
99-
100-
### Rust
101-
102-
Several basic kernels and CPU-oriented parallel reductions are also implemented in Rust.
103-
To build and run the Rust code, you need to have the Rust toolchain installed. You can use `rustup` to install it:
104-
105-
```sh
106-
rustup toolchain install nightly
107-
cargo +nightly test --release
108-
cargo +nightly bench
109-
```
110-
11142
## Results
11243

11344
Different hardware would yield different results, but the general trends and observations are:
@@ -383,3 +314,85 @@ test rayon ... bench: 42,649 ns/iter (+/- 4,220)
383314
test tokio ... bench: 83,644 ns/iter (+/- 3,684)
384315
test smol ... bench: 3,346 ns/iter (+/- 86)
385316
```
317+
318+
## Build & Run
319+
320+
### Rust
321+
322+
Several basic kernels and CPU-oriented parallel reductions are also implemented in Rust.
323+
To build and run the Rust code, you need to have the Rust toolchain installed. You can use `rustup` to install it:
324+
325+
```sh
326+
rustup toolchain install nightly
327+
cargo +nightly test --release
328+
cargo +nightly bench
329+
```
330+
331+
### C++
332+
333+
This repository is a CMake project designed to be built on Linux with GCC, Clang, or NVCC.
334+
You may need to install the following dependencies for complete functionality:
335+
336+
```sh
337+
sudo apt install libblas-dev # For OpenBLAS on Linux
338+
sudo apt install libnuma1 libnuma-dev # For NUMA allocators on Linux
339+
sudo apt install cuda-toolkit # This may not be as easy 😈
340+
```
341+
342+
The following script will, by default, generate a 1GB array of numbers and reduce them using every available backend.
343+
All the classical Google Benchmark arguments are supported, including `--benchmark_filter=opencl`.
344+
All the library dependencies, including GTest, GBench, Intel oneTBB, FMT, and Thrust with CUB, will be automatically fetched.
345+
You are expected to build this on an x86 machine with CUDA drivers installed.
346+
347+
```sh
348+
cmake -B build_release -D CMAKE_BUILD_TYPE=Release # Generate the build files
349+
cmake --build build_release --config Release -j # Build the project
350+
build_release/reduce_bench # Run all benchmarks
351+
build_release/reduce_bench --benchmark_filter="cuda" # Only CUDA-related
352+
PARALLEL_REDUCTIONS_LENGTH=1024 build_release/reduce_bench # Set a different input size
353+
```
354+
355+
Need a more fine-grained control to run only CUDA-based backends?
356+
357+
```sh
358+
cmake -D CMAKE_CUDA_COMPILER=nvcc -D CMAKE_C_COMPILER=gcc-12 -D CMAKE_CXX_COMPILER=g++-12 -B build_release
359+
cmake --build build_release --config Release -j
360+
build_release/reduce_bench --benchmark_filter=cuda
361+
```
362+
363+
Need the opposite, to build & run only CPU-based backends on a CUDA-capable machine?
364+
365+
```sh
366+
cmake -D USE_INTEL_TBB=1 -D USE_NVIDIA_CCCL=0 -B build_release
367+
cmake --build build_release --config Release -j
368+
build_release/reduce_bench --benchmark_filter=unrolled
369+
```
370+
371+
Want to use the non-default Clang distribution on macOS?
372+
OpenBLAS will be superseded by Apple's `Accelerate.framework`, but LLVM and OpenMP should ideally be pulled from Homebrew:
373+
374+
```sh
375+
brew install llvm libomp
376+
cmake -B build_release \
377+
-D CMAKE_CXX_COMPILER=$(brew --prefix llvm)/bin/clang++ \
378+
-D OpenMP_ROOT=$(brew --prefix llvm) \
379+
-D CMAKE_BUILD_RPATH=$(brew --prefix llvm)/lib \
380+
-D CMAKE_INSTALL_RPATH=$(brew --prefix llvm)/lib
381+
cmake --build build_release --config Release -j
382+
build_release/reduce_bench
383+
```
384+
385+
To debug or introspect, the procedure is similar:
386+
387+
```sh
388+
cmake -D CMAKE_CUDA_COMPILER=nvcc -D CMAKE_C_COMPILER=gcc -D CMAKE_CXX_COMPILER=g++ -D CMAKE_BUILD_TYPE=Debug -B build_debug
389+
cmake --build build_debug --config Debug
390+
```
391+
392+
And then run your favorite debugger.
393+
394+
Optional backends:
395+
396+
- To enable [Intel OpenCL](https://github.com/intel/compute-runtime/blob/master/README.md) on CPUs: `apt-get install intel-opencl-icd`.
397+
- To run on integrated Intel GPU, follow [this guide](https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/prerequisites.html).
398+

reduce_bench.cpp

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -327,11 +327,14 @@ int main(int argc, char **argv) {
327327
tgt.language_version);
328328
#endif // defined(__OPENCL__)
329329

330-
// Memset is only useful as a baseline, but running it will corrupt our buffer
331-
// register_("memset", memset_t {}, dataset);
332-
// register_("memset/std::threads", threads_gt<memset_t> {}, dataset);
330+
// ? Memset is only useful as a baseline, but running it will corrupt our buffer
331+
// ? register_("memset", memset_t {}, dataset);
332+
// ? register_("memset/std::threads", threads_gt<memset_t> {}, dataset);
333333

334334
// Generic CPU benchmarks
335+
#if defined(_OPENMP)
336+
register_("serial/f32/openmp", openmp_t {}, dataset);
337+
#endif // defined(_OPENMP)
335338
register_("unrolled/f32", unrolled_gt<float> {}, dataset);
336339
register_("unrolled/f64", unrolled_gt<double> {}, dataset);
337340
register_("std::accumulate/f32", stl_accumulate_gt<float> {}, dataset);
@@ -340,6 +343,10 @@ int main(int argc, char **argv) {
340343
register_("unrolled/f32/tf::taskflow", taskflow_gt<unrolled_gt<float>> {}, dataset);
341344
register_("unrolled/f64/av::fork_union", fork_union_gt<unrolled_gt<double>> {}, dataset);
342345
register_("unrolled/f64/tf::taskflow", taskflow_gt<unrolled_gt<double>> {}, dataset);
346+
#if defined(USE_INTEL_TBB)
347+
register_("unrolled/f32/oneapi::tbb", tbb_gt<unrolled_gt<float>> {}, dataset);
348+
register_("unrolled/f64/oneapi::tbb", tbb_gt<unrolled_gt<double>> {}, dataset);
349+
#endif // defined(USE_INTEL_TBB)
343350

344351
// ! BLAS struggles with zero-strided arguments!
345352
// ! register_("blas/f32", blas_dot_t {}, dataset);

0 commit comments

Comments
 (0)