diff --git a/README.md b/README.md
index 454ce00..2c5c783 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ Parallel algorithm building blocks for the Julia ecosystem, targeting multithrea
### A Uniform API, Everywhere
-Offering standard library functions (e.g., `sort`, `mapreduce`, `accumulate`), higher-order functions (e.g., `sum`, `cumprod`, `any`), and cross-architecture custom loops (`foreachindex`, `foraxes`), AcceleratedKernels.jl lets you write high-performance code once and run it on any supported architecture — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing.
+Offering standard library algorithms (e.g., `sort`, `mapreduce`, `accumulate`), higher-order functions (e.g., `sum`, `cumprod`, `any`), and cross-architecture custom loops (`foreachindex`, `foraxes`), AcceleratedKernels.jl lets you write high-performance code once and run it on all supported architectures — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing.
@@ -320,8 +320,7 @@ Help is very welcome for any of the below:
switch_below=(1, 10, 100, 1000, 10000)
end
```
-- Add performant multithreaded Julia implementations to all algorithms; e.g. `foreachindex` has one, `any` does not.
- - EDIT: as of v0.2.0, only `sort` needs a multithreaded implementation.
+- We need multithreaded implementations of `sort`, N-dimensional `mapreduce` (in `OhMyThreads.tmapreduce`) and `accumulate` (again, probably in `OhMyThreads`).
- Any way to expose the warp-size from the backends? Would be useful in reductions.
- Add a performance regressions runner.
- **Other ideas?** Post an issue, or open a discussion on the Julia Discourse.
diff --git a/src/predicates.jl b/src/predicates.jl
index 7bb0bbc..23419ba 100644
--- a/src/predicates.jl
+++ b/src/predicates.jl
@@ -57,9 +57,9 @@ it in your application. When only one thread is needed, there is no overhead.
## GPU
There are two possible `alg` choices:
-- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag;
- and uses a global flag to write the result; this is only one platform we are aware of (Intel UHD
- 620 integrated graphics cards) where such writes hang.
+- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; there is
+ only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple
+ threads writing to the same memory location - even if writing the same value - hang the device.
- `MapReduce(; temp=nothing, switch_below=0)`: a conservative [`mapreduce`](@ref)-based
implementation which can be used on all platforms, but does not use shortcircuiting
optimisations. You can set the `temp` and `switch_below` keyword arguments to be forwarded to
@@ -201,9 +201,9 @@ it in your application. When only one thread is needed, there is no overhead.
## GPU
There are two possible `alg` choices:
-- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag;
- and uses a global flag to write the result; this is only one platform we are aware of (Intel UHD
- 620 integrated graphics cards) where such writes hang.
+- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; there is
+ only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple
+ threads writing to the same memory location - even if writing the same value - hang the device.
- `MapReduce(; temp=nothing, switch_below=0)`: a conservative [`mapreduce`](@ref)-based
implementation which can be used on all platforms, but does not use shortcircuiting
optimisations. You can set the `temp` and `switch_below` keyword arguments to be forwarded to