Skip to content

Commit e47b061

Browse files
committed
added N-dimensional reduce and mapreduce, map, docs, and tests for each. In-place functions now also return the modified argument as in Base. Updated README.
1 parent 1adfc5f commit e47b061

27 files changed

+1409
-63
lines changed

Project.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "AcceleratedKernels"
22
uuid = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
33
authors = ["Andrei-Leonard Nicusan <leonard@evophase.co.uk> and contributors"]
4-
version = "0.1.0"
4+
version = "0.2.0-DEV"
55

66
[deps]
77
ArgCheck = "dce04be8-c92d-5529-be00-80e4d2c0e197"
@@ -10,11 +10,14 @@ GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
1010
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
1111
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
1212
Polyester = "f517fe37-dbe3-4b94-8317-1923a5111588"
13+
Unrolled = "9602ed7d-8fef-5bc8-8597-8f21381861e8"
1314

1415
[compat]
1516
ArgCheck = "2.1"
1617
DocStringExtensions = "0.9"
1718
GPUArraysCore = "0.1"
1819
KernelAbstractions = "0.9"
20+
Markdown = "1.11"
1921
Polyester = "0.7"
22+
Unrolled = "0.1.5"
2023
julia = "1.6.7"

README.md

Lines changed: 75 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,13 @@ Parallel algorithm building blocks for the Julia ecosystem, targeting multithrea
1717
- [5. API Examples](#5-api-examples)
1818
- [5.1. Using Different Backends](#51-using-different-backends)
1919
- [5.2. `foreachindex`](#52-foreachindex)
20-
- [5.3. `sort` and friends](#53-sort-and-friends)
21-
- [5.4. `reduce`](#54-reduce)
22-
- [5.5. `mapreduce`](#55-mapreduce)
23-
- [5.6. `accumulate`](#56-accumulate)
24-
- [5.7. `searchsorted` and friends](#57-searchsorted-and-friends)
25-
- [5.8. `all` / `any`](#58-all--any)
20+
- [5.3. `map`](#53-map)
21+
- [5.4. `sort` and friends](#54-sort-and-friends)
22+
- [5.5. `reduce`](#55-reduce)
23+
- [5.6. `mapreduce`](#56-mapreduce)
24+
- [5.7. `accumulate`](#57-accumulate)
25+
- [5.8. `searchsorted` and friends](#58-searchsorted-and-friends)
26+
- [5.9. `all` / `any`](#59-all--any)
2627
- [6. Custom Structs](#6-custom-structs)
2728
- [7. Testing](#7-testing)
2829
- [8. Issues and Debugging](#8-issues-and-debugging)
@@ -72,18 +73,19 @@ Below is an overview of the currently-implemented algorithms, along with some co
7273
| Function Family | AcceleratedKernels.jl Functions | Other Common Names |
7374
| --------------------------------------------- | ------------------------------------------------ | --------------------------------------------------------- |
7475
| [General Looping](#52-foreachindex) | `foreachindex` | `Kokkos::parallel_for` `RAJA::forall` `thrust::transform` |
75-
| [Sorting](#53-sort-and-friends) | `sort` `sort!` | `sort` `sort_team` `stable_sort` |
76+
| [General Looping](#53-map) | `map` `map!` | `thrust::transform` |
77+
| [Sorting](#54-sort-and-friends) | `sort` `sort!` | `sort` `sort_team` `stable_sort` |
7678
| | `merge_sort` `merge_sort!` | |
7779
| | `merge_sort_by_key` `merge_sort_by_key!` | `sort_team_by_key` |
7880
| | `sortperm` `sortperm!` | `sort_permutation` `index_permutation` |
7981
| | `merge_sortperm` `merge_sortperm!` | |
8082
| | `merge_sortperm_lowmem` `merge_sortperm_lowmem!` | |
81-
| [Reduction](#54-reduce) | `reduce` | `Kokkos:parallel_reduce` `fold` `aggregate` |
82-
| [MapReduce](#55-mapreduce) | `mapreduce` | `transform_reduce` `fold` |
83-
| [Accumulation](#56-accumulate) | `accumulate` `accumulate!` | `prefix_sum` `thrust::scan` `cumsum` |
84-
| [Binary Search](#57-searchsorted-and-friends) | `searchsortedfirst` `searchsortedfirst!` | `std::lower_bound` |
83+
| [Reduction](#55-reduce) | `reduce` | `Kokkos:parallel_reduce` `fold` `aggregate` |
84+
| [MapReduce](#56-mapreduce) | `mapreduce` | `transform_reduce` `fold` |
85+
| [Accumulation](#57-accumulate) | `accumulate` `accumulate!` | `prefix_sum` `thrust::scan` `cumsum` |
86+
| [Binary Search](#58-searchsorted-and-friends) | `searchsortedfirst` `searchsortedfirst!` | `std::lower_bound` |
8587
| | `searchsortedlast` `searchsortedlast!` | `thrust::upper_bound` |
86-
| [Predicates](#58-all--any) | `all` `any` | |
88+
| [Predicates](#59-all--any) | `all` `any` | |
8789

8890

8991
## 5. API Examples
@@ -214,7 +216,40 @@ Similarly, for performance on the CPU the overhead of spawning threads should be
214216
```
215217

216218

217-
### 5.3. `sort` and friends
219+
### 5.3. `map`
220+
Parallel mapping of a function over each element of an iterable via `foreachindex`:
221+
- `map!` (in-place), `map` (out-of-place)
222+
223+
Function signature:
224+
```julia
225+
map!(
226+
f, dst::AbstractArray, src::AbstractArray;
227+
228+
# CPU settings
229+
scheduler=:threads,
230+
max_tasks=Threads.nthreads(),
231+
min_elems=1,
232+
233+
# GPU settings
234+
block_size=256,
235+
)
236+
```
237+
238+
Example:
239+
```julia
240+
import Metal
241+
import AcceleratedKernels as AK
242+
243+
x = MtlArray(rand(Float32, 100_000))
244+
y = similar(x)
245+
AK.map!(y, x) do x_elem
246+
T = typeof(x_elem)
247+
T(2) * x_elem + T(1)
248+
end
249+
```
250+
251+
252+
### 5.4. `sort` and friends
218253
Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:
219254
- `sort!` (in-place), `sort` (out-of-place)
220255
- `sortperm!`, `sortperm`
@@ -277,15 +312,23 @@ AK.sort!(v, temp=temp)
277312
```
278313

279314

280-
### 5.4. `reduce`
315+
### 5.5. `reduce`
281316
Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.
282317
- **Other names**: `Kokkos:parallel_reduce`, `fold`, `aggregate`.
283318

319+
**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**
320+
284321
Function signature:
285322
```julia
286-
reduce(op, src::AbstractGPUVector; init,
287-
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing,
288-
switch_below::Int=0)
323+
reduce(
324+
op, src::AbstractGPUArray;
325+
init,
326+
dims::Union{Nothing, Int}=nothing,
327+
328+
block_size::Int=256,
329+
temp::Union{Nothing, AbstractGPUArray}=nothing,
330+
switch_below::Int=0,
331+
)
289332
```
290333

291334
Example computing a sum:
@@ -307,15 +350,23 @@ end
307350
Yes, the lambda within the `do` block can equally well be executed on both CPU and GPU, no code changes/duplication required.
308351

309352

310-
### 5.5. `mapreduce`
353+
### 5.6. `mapreduce`
311354
Equivalent to `reduce(op, map(f, iterable))`, without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).
312355
- **Other names**: `transform_reduce`, some `fold` implementations include the mapping function too.
313356

357+
**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**
358+
314359
Function signature:
315360
```julia
316-
mapreduce(f, op, src::AbstractGPUVector; init,
317-
block_size::Int=256, temp::Union{Nothing, AbstractGPUVector}=nothing,
318-
switch_below::Int=0)
361+
mapreduce(
362+
f, op, src::AbstractGPUArray;
363+
init,
364+
dims::Union{Nothing, Int}=nothing,
365+
366+
block_size::Int=256,
367+
temp::Union{Nothing, AbstractGPUArray}=nothing,
368+
switch_below::Int=0,
369+
)
319370
```
320371

321372
Example computing the minimum of absolute values:
@@ -330,7 +381,7 @@ AK.mapreduce(abs, (x, y) -> x < y ? x : y, v, init=typemax(Int32))
330381
As for `reduce`, when there are fewer than `switch_below` elements left to reduce, they can be copied back to the host and we switch to a CPU reduction. The `init` initialiser has to be a neutral element for `op`, i.e. same type as returned from `f` (`f` can change the type of the collection, see the "Custom Structs" section below for an example). The temporary array `temp` needs to have at least `(length(src) + 2 * block_size - 1) ÷ (2 * block_size)` elements and have `eltype(src) === typeof(init)`.
331382

332383

333-
### 5.6. `accumulate`
384+
### 5.7. `accumulate`
334385
Compute accumulated running totals along a sequence by applying a binary operator to all elements up to the current one; often used in GPU programming as a first step in finding / extracting subsets of data.
335386
- `accumulate!` (in-place), `accumulate` (allocating); inclusive or exclusive.
336387
- **Other names**: prefix sum, `thrust::scan`, cumulative sum; inclusive (or exclusive) if the first element is included in the accumulation (or not).
@@ -359,7 +410,7 @@ AK.accumulate!(+, v, init=0)
359410
The temporaries `temp_v` and `temp_flags` should both have at least `(length(v) + 2 * block_size - 1) ÷ (2 * block_size)` elements; `eltype(v) === eltype(temp_v)`; the elements in `temp_flags` can be any integers, but `Int8` is used by default to reduce memory usage.
360411

361412

362-
### 5.7. `searchsorted` and friends
413+
### 5.8. `searchsorted` and friends
363414
Find the indices where some elements `x` should be inserted into a sorted sequence `v` to maintain the sorted order. Effectively applying the Julia.Base functions in parallel on a GPU using `foreachindex`.
364415
- `searchsortedfirst!` (in-place), `searchsortedfirst` (allocating): index of first element in `v` >= `x[j]`.
365416
- `searchsortedlast!`, `searchsortedlast`: index of last element in `v` <= `x[j]`.
@@ -413,7 +464,7 @@ AK.searchsortedfirst!(ix, v, x)
413464
```
414465

415466

416-
### 5.8. `all` / `any`
467+
### 5.9. `all` / `any`
417468
Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found.
418469
- **Other names**: not often implemented standalone on GPUs, typically included as part of a reduction.
419470

docs/Project.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
[deps]
2+
AcceleratedKernels = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
23
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
34

45
[compat]

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ makedocs(;
2222
"Manual" =>[
2323
"Using Different Backends" => "api/using_backends.md",
2424
"General Loops" => "api/foreachindex.md",
25+
"Map" => "api/map.md",
2526
"Sorting" => "api/sort.md",
2627
"Reduce" => "api/reduce.md",
2728
"MapReduce" => "api/mapreduce.md",

docs/src/api/accumulate.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22

33
```@example
44
import AcceleratedKernels as AK # hide
5-
AK.DocHelpers.readme_section("### 5.6. `accumulate`") # hide
5+
AK.DocHelpers.readme_section("### 5.7. `accumulate`") # hide
66
```

docs/src/api/binarysearch.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22

33
```@example
44
import AcceleratedKernels as AK # hide
5-
AK.DocHelpers.readme_section("### 5.7. `searchsorted` and friends") # hide
5+
AK.DocHelpers.readme_section("### 5.8. `searchsorted` and friends") # hide
66
```

docs/src/api/map.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
### Map
2+
3+
```@example
4+
import AcceleratedKernels as AK # hide
5+
AK.DocHelpers.readme_section("### 5.3. `map`") # hide
6+
```

docs/src/api/mapreduce.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,11 @@
22

33
```@example
44
import AcceleratedKernels as AK # hide
5-
AK.DocHelpers.readme_section("### 5.5. `mapreduce`") # hide
5+
AK.DocHelpers.readme_section("### 5.6. `mapreduce`") # hide
6+
```
7+
8+
---
9+
10+
```@docs
11+
AcceleratedKernels.mapreduce
612
```

docs/src/api/predicates.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
```@example
44
import AcceleratedKernels as AK # hide
5-
AK.DocHelpers.readme_section("### 5.8. `all` / `any`") # hide
5+
AK.DocHelpers.readme_section("### 5.9. `all` / `any`") # hide
66
```
77

88
**Note on the `cooperative` keyword**: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set `cooperative=false` to use a safer `mapreduce`-based implementation.

docs/src/api/reduce.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,11 @@
22

33
```@example
44
import AcceleratedKernels as AK # hide
5-
AK.DocHelpers.readme_section("### 5.4. `reduce`") # hide
5+
AK.DocHelpers.readme_section("### 5.5. `reduce`") # hide
6+
```
7+
8+
---
9+
10+
```@docs
11+
AcceleratedKernels.reduce
612
```

0 commit comments

Comments
 (0)