Skip to content

Commit c340f4c

Browse files
committed
Docs: List notable features
1 parent 37221a7 commit c340f4c

File tree

3 files changed

+40
-12
lines changed

3 files changed

+40
-12
lines changed

.vscode/settings.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,15 @@
1212
"HUGEPAGE",
1313
"HUGETLB",
1414
"Kahan",
15+
"Lookaside",
1516
"METALLIB",
1617
"NUMA",
1718
"opencl",
1819
"openmp",
1920
"shfl",
2021
"SPIR",
2122
"STREQUAL",
23+
"strided",
2224
"threadgroup",
2325
"unseq",
2426
"Vardanian",

README.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,31 @@
55
One of the canonical examples when designing parallel algorithms is implementing parallel tree-like reductions, which is a special case of accumulating a bunch of numbers located in a continuous block of memory.
66
In modern C++, most developers would call `std::accumulate(array.begin(), array.end(), 0)`, and in Python, it's just a `sum(array)`.
77
Implementing those operations with high utilization in many-core systems is surprisingly non-trivial and depends heavily on the hardware architecture.
8-
Moreover, on arrays with billions of elements, the default `float` error mounts, and the results become inaccurate unless a [Kahan-like scheme](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) is used.
9-
108
This repository contains several educational examples showcasing the performance differences between different solutions:
119

1210
- Single-threaded but SIMD-accelerated code:
1311
- SSE, AVX, AVX-512 on x86.
1412
- 🔜 NEON and SVE on Arm.
1513
- OpenMP `reduction` clause.
1614
- Thrust with its `thrust::reduce`.
17-
- CUDA kernels with and w/out warp-reductions.
15+
- CUB with its `cub::DeviceReduce::Sum`.
16+
- CUDA kernels with and w/out [warp-primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/).
17+
- CUDA kernels with [Tensor-Core](https://www.nvidia.com/en-gb/data-center/tensor-cores/) acceleration.
18+
- [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and cuBLAS strided vector and matrix routines.
1819
- OpenCL kernels, eight of them.
1920
- Parallel STL `<algorithm>` in GCC with Intel oneTBB.
2021

21-
Previously, it also compared ArrayFire, Halide, and Vulkan queues for SPIR-V kernels and SyCL.
22-
Examples were collected from early 2010s until 2019 and later updated in 2022.
22+
Notably:
23+
24+
- on arrays with billions of elements, the default `float` error mounts, and the results become inaccurate unless a [Kahan-like scheme](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) is used.
25+
- to minimize the overhead [Translation Lookaside Buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) __(TLB)__ misses, the arrays are aligned to the OS page size and are allocated in [huge pages on Linux](https://wiki.debian.org/Hugepages), if possible.
26+
- to reduce the memory access latency on many-core [Non-Uniform Memory Access](https://en.wikipedia.org/wiki/Non-uniform_memory_access) __(NUMA)__ systems, `libnuma` and `pthread` help maximize data affinity.
27+
- to "hide" latency on wide CPU registers (like `ZMM`), expensive Assembly instructions executed on different [CPU ports](https://easyperf.net/blog/2018/03/21/port-contention#utilizing-full-capacity-of-the-load-instructions) are interleaved.
28+
29+
---
30+
31+
The examples in this repository were originally written in early 2010s and were updated in 2019, 2022, and 2025.
32+
Previously, it also included ArrayFire, Halide, and Vulkan queues for SPIR-V kernels and SyCL.
2333

2434
- [Lecture Slides](https://drive.google.com/file/d/16AicAl99t3ZZFnza04Wnw_Vuem0w8lc7/view?usp=sharing) from 2019.
2535
- [CppRussia Talk](https://youtu.be/AA4RI6o0h1U) in Russia in 2019.

reduce_opencl.hpp

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121

2222
namespace ashvardanian::reduce {
2323

24+
/**
25+
* @brief OpenCL target device information, including its name, driver version,
26+
* the number of compute units, and the unique device ID.
27+
*/
2428
struct opencl_target_t {
2529
std::string device_name;
2630
std::string device_version;
@@ -38,6 +42,12 @@ static int const opencl_max_threads = 12000;
3842
std::vector<opencl_target_t> opencl_targets();
3943
char const *opencl_error_name(cl_int) noexcept;
4044

45+
/**
46+
* @brief OpenCL kernel wrapper for parallel reductions.
47+
*
48+
* ! The kernels are loaded from a file and compiled at runtime, so the working
49+
* ! directory must be the same as the executable.
50+
*/
4151
struct opencl_t {
4252

4353
static constexpr std::size_t kernel_variants_k = 8;
@@ -46,9 +56,9 @@ struct opencl_t {
4656
"reduce_bi_step", "reduce_unrolled", "reduce_unrolled_fully", "reduce_w_brents_theorem",
4757
};
4858

49-
std::size_t const count_items = 0;
50-
std::size_t const count_threads = 0;
51-
std::size_t const items_per_group = 0;
59+
std::size_t count_items = 0;
60+
std::size_t count_threads = 0;
61+
std::size_t items_per_group = 0;
5262

5363
private:
5464
cl_context context = NULL;
@@ -68,8 +78,10 @@ struct opencl_t {
6878
std::vector<float> returned_outputs;
6979

7080
public:
71-
opencl_t(float const *b, float const *e, opencl_target_t target, std::size_t items_per_group = 1024,
72-
char const *kernel_name_cstr = kernels_k[0])
81+
opencl_t() = default;
82+
opencl_t( //
83+
float const *b, float const *e, opencl_target_t target, std::size_t items_per_group = 1024,
84+
char const *kernel_name_cstr = kernels_k[0])
7385
: count_items(e - b), count_threads((opencl_max_threads / items_per_group) * items_per_group),
7486
items_per_group(items_per_group) {
7587
// Load the kernel source code into the array source_str
@@ -139,7 +151,7 @@ struct opencl_t {
139151
if (status != 0) throw std::logic_error(opencl_error_name(status));
140152
}
141153

142-
~opencl_t() {
154+
~opencl_t() noexcept {
143155
cl_int status = 0;
144156
status = clFlush(queue);
145157
status = clFinish(queue);
@@ -157,7 +169,7 @@ struct opencl_t {
157169
(void)status;
158170
}
159171

160-
float operator()() {
172+
float operator()() const {
161173
cl_int status = 0;
162174
std::size_t global_ws_offset = 0;
163175
status = clEnqueueNDRangeKernel( //
@@ -177,6 +189,10 @@ struct opencl_t {
177189
}
178190
};
179191

192+
/**
193+
* @brief Returns a list of OpenCL target devices.
194+
* @return Array of `opencl_target_t` objects.
195+
*/
180196
std::vector<opencl_target_t> opencl_targets() {
181197

182198
std::vector<opencl_target_t> result;

0 commit comments

Comments
 (0)