You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
added N-dimensional reduce and mapreduce, map, docs, and tests for each. In-place functions now also return the modified argument as in Base. Updated README.
Yes, the lambda within the `do` block can equally well be executed on both CPU and GPU, no code changes/duplication required.
308
351
309
352
310
-
### 5.5. `mapreduce`
353
+
### 5.6. `mapreduce`
311
354
Equivalent to `reduce(op, map(f, iterable))`, without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).
312
355
-**Other names**: `transform_reduce`, some `fold` implementations include the mapping function too.
313
356
357
+
**New in AcceleratedKernels 0.2.0: N-dimensional reductions via the `dims` keyword**
@@ -330,7 +381,7 @@ AK.mapreduce(abs, (x, y) -> x < y ? x : y, v, init=typemax(Int32))
330
381
As for `reduce`, when there are fewer than `switch_below` elements left to reduce, they can be copied back to the host and we switch to a CPU reduction. The `init` initialiser has to be a neutral element for `op`, i.e. same type as returned from `f` (`f` can change the type of the collection, see the "Custom Structs" section below for an example). The temporary array `temp` needs to have at least `(length(src) + 2 * block_size - 1) ÷ (2 * block_size)` elements and have `eltype(src) === typeof(init)`.
331
382
332
383
333
-
### 5.6. `accumulate`
384
+
### 5.7. `accumulate`
334
385
Compute accumulated running totals along a sequence by applying a binary operator to all elements up to the current one; often used in GPU programming as a first step in finding / extracting subsets of data.
335
386
-`accumulate!` (in-place), `accumulate` (allocating); inclusive or exclusive.
336
387
-**Other names**: prefix sum, `thrust::scan`, cumulative sum; inclusive (or exclusive) if the first element is included in the accumulation (or not).
@@ -359,7 +410,7 @@ AK.accumulate!(+, v, init=0)
359
410
The temporaries `temp_v` and `temp_flags` should both have at least `(length(v) + 2 * block_size - 1) ÷ (2 * block_size)` elements; `eltype(v) === eltype(temp_v)`; the elements in `temp_flags` can be any integers, but `Int8` is used by default to reduce memory usage.
360
411
361
412
362
-
### 5.7. `searchsorted` and friends
413
+
### 5.8. `searchsorted` and friends
363
414
Find the indices where some elements `x` should be inserted into a sorted sequence `v` to maintain the sorted order. Effectively applying the Julia.Base functions in parallel on a GPU using `foreachindex`.
364
415
-`searchsortedfirst!` (in-place), `searchsortedfirst` (allocating): index of first element in `v` >= `x[j]`.
365
416
-`searchsortedlast!`, `searchsortedlast`: index of last element in `v` <= `x[j]`.
Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found.
418
469
-**Other names**: not often implemented standalone on GPUs, typically included as part of a reduction.
**Note on the `cooperative` keyword**: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set `cooperative=false` to use a safer `mapreduce`-based implementation.
0 commit comments