You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-5Lines changed: 15 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,21 +5,31 @@
5
5
One of the canonical examples when designing parallel algorithms is implementing parallel tree-like reductions, which is a special case of accumulating a bunch of numbers located in a continuous block of memory.
6
6
In modern C++, most developers would call `std::accumulate(array.begin(), array.end(), 0)`, and in Python, it's just a `sum(array)`.
7
7
Implementing those operations with high utilization in many-core systems is surprisingly non-trivial and depends heavily on the hardware architecture.
8
-
Moreover, on arrays with billions of elements, the default `float` error mounts, and the results become inaccurate unless a [Kahan-like scheme](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) is used.
9
-
10
8
This repository contains several educational examples showcasing the performance differences between different solutions:
11
9
12
10
- Single-threaded but SIMD-accelerated code:
13
11
- SSE, AVX, AVX-512 on x86.
14
12
- 🔜 NEON and SVE on Arm.
15
13
- OpenMP `reduction` clause.
16
14
- Thrust with its `thrust::reduce`.
17
-
- CUDA kernels with and w/out warp-reductions.
15
+
- CUB with its `cub::DeviceReduce::Sum`.
16
+
- CUDA kernels with and w/out [warp-primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/).
17
+
- CUDA kernels with [Tensor-Core](https://www.nvidia.com/en-gb/data-center/tensor-cores/) acceleration.
18
+
-[BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and cuBLAS strided vector and matrix routines.
18
19
- OpenCL kernels, eight of them.
19
20
- Parallel STL `<algorithm>` in GCC with Intel oneTBB.
20
21
21
-
Previously, it also compared ArrayFire, Halide, and Vulkan queues for SPIR-V kernels and SyCL.
22
-
Examples were collected from early 2010s until 2019 and later updated in 2022.
22
+
Notably:
23
+
24
+
- on arrays with billions of elements, the default `float` error mounts, and the results become inaccurate unless a [Kahan-like scheme](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) is used.
25
+
- to minimize the overhead [Translation Lookaside Buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)__(TLB)__ misses, the arrays are aligned to the OS page size and are allocated in [huge pages on Linux](https://wiki.debian.org/Hugepages), if possible.
26
+
- to reduce the memory access latency on many-core [Non-Uniform Memory Access](https://en.wikipedia.org/wiki/Non-uniform_memory_access)__(NUMA)__ systems, `libnuma` and `pthread` help maximize data affinity.
27
+
- to "hide" latency on wide CPU registers (like `ZMM`), expensive Assembly instructions executed on different [CPU ports](https://easyperf.net/blog/2018/03/21/port-contention#utilizing-full-capacity-of-the-load-instructions) are interleaved.
28
+
29
+
---
30
+
31
+
The examples in this repository were originally written in early 2010s and were updated in 2019, 2022, and 2025.
32
+
Previously, it also included ArrayFire, Halide, and Vulkan queues for SPIR-V kernels and SyCL.
23
33
24
34
-[Lecture Slides](https://drive.google.com/file/d/16AicAl99t3ZZFnza04Wnw_Vuem0w8lc7/view?usp=sharing) from 2019.
25
35
-[CppRussia Talk](https://youtu.be/AA4RI6o0h1U) in Russia in 2019.
0 commit comments