diff --git a/README.md b/README.md index 454ce00..2c5c783 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Parallel algorithm building blocks for the Julia ecosystem, targeting multithrea ### A Uniform API, Everywhere -Offering standard library functions (e.g., `sort`, `mapreduce`, `accumulate`), higher-order functions (e.g., `sum`, `cumprod`, `any`), and cross-architecture custom loops (`foreachindex`, `foraxes`), AcceleratedKernels.jl lets you write high-performance code once and run it on any supported architecture — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing. +Offering standard library algorithms (e.g., `sort`, `mapreduce`, `accumulate`), higher-order functions (e.g., `sum`, `cumprod`, `any`), and cross-architecture custom loops (`foreachindex`, `foraxes`), AcceleratedKernels.jl lets you write high-performance code once and run it on all supported architectures — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing. @@ -320,8 +320,7 @@ Help is very welcome for any of the below: switch_below=(1, 10, 100, 1000, 10000) end ``` -- Add performant multithreaded Julia implementations to all algorithms; e.g. `foreachindex` has one, `any` does not. - - EDIT: as of v0.2.0, only `sort` needs a multithreaded implementation. +- We need multithreaded implementations of `sort`, N-dimensional `mapreduce` (in `OhMyThreads.tmapreduce`) and `accumulate` (again, probably in `OhMyThreads`). - Any way to expose the warp-size from the backends? Would be useful in reductions. - Add a performance regressions runner. - **Other ideas?** Post an issue, or open a discussion on the Julia Discourse. diff --git a/src/predicates.jl b/src/predicates.jl index 7bb0bbc..23419ba 100644 --- a/src/predicates.jl +++ b/src/predicates.jl @@ -57,9 +57,9 @@ it in your application. When only one thread is needed, there is no overhead. ## GPU There are two possible `alg` choices: -- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; - and uses a global flag to write the result; this is only one platform we are aware of (Intel UHD - 620 integrated graphics cards) where such writes hang. +- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; there is + only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple + threads writing to the same memory location - even if writing the same value - hang the device. - `MapReduce(; temp=nothing, switch_below=0)`: a conservative [`mapreduce`](@ref)-based implementation which can be used on all platforms, but does not use shortcircuiting optimisations. You can set the `temp` and `switch_below` keyword arguments to be forwarded to @@ -201,9 +201,9 @@ it in your application. When only one thread is needed, there is no overhead. ## GPU There are two possible `alg` choices: -- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; - and uses a global flag to write the result; this is only one platform we are aware of (Intel UHD - 620 integrated graphics cards) where such writes hang. +- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; there is + only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple + threads writing to the same memory location - even if writing the same value - hang the device. - `MapReduce(; temp=nothing, switch_below=0)`: a conservative [`mapreduce`](@ref)-based implementation which can be used on all platforms, but does not use shortcircuiting optimisations. You can set the `temp` and `switch_below` keyword arguments to be forwarded to