Merge pull request #21 from JuliaGPU/reduce-init-use

anicusan · web-flow · commit 63ec68e21556 · 2025-02-05T13:24:02.000Z
typos
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ Parallel algorithm building blocks for the Julia ecosystem, targeting multithrea
 
 
 ### A Uniform API, Everywhere
-Offering standard library functions (e.g., `sort`, `mapreduce`, `accumulate`), higher-order functions (e.g., `sum`, `cumprod`, `any`), and cross-architecture custom loops (`foreachindex`, `foraxes`), AcceleratedKernels.jl lets you write high-performance code once and run it on any supported architecture — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing.
+Offering standard library algorithms (e.g., `sort`, `mapreduce`, `accumulate`), higher-order functions (e.g., `sum`, `cumprod`, `any`), and cross-architecture custom loops (`foreachindex`, `foraxes`), AcceleratedKernels.jl lets you write high-performance code once and run it on all supported architectures — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing.
 
 
 <table>
@@ -320,8 +320,7 @@ Help is very welcome for any of the below:
       switch_below=(1, 10, 100, 1000, 10000)
   end
   ```
-- Add performant multithreaded Julia implementations to all algorithms; e.g. `foreachindex` has one, `any` does not.
-  - EDIT: as of v0.2.0, only `sort` needs a multithreaded implementation.
+- We need multithreaded implementations of `sort`, N-dimensional `mapreduce` (in `OhMyThreads.tmapreduce`) and `accumulate` (again, probably in `OhMyThreads`).
 - Any way to expose the warp-size from the backends? Would be useful in reductions.
 - Add a performance regressions runner.
 - **Other ideas?** Post an issue, or open a discussion on the Julia Discourse.
diff --git a/src/predicates.jl b/src/predicates.jl
@@ -57,9 +57,9 @@ it in your application. When only one thread is needed, there is no overhead.
 
 ## GPU
 There are two possible `alg` choices:
-- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; 
-  and uses a global flag to write the result; this is only one platform we are aware of (Intel UHD
-  620 integrated graphics cards) where such writes hang.
+- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; there is
+  only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple
+  threads writing to the same memory location - even if writing the same value - hang the device.
 - `MapReduce(; temp=nothing, switch_below=0)`: a conservative [`mapreduce`](@ref)-based
   implementation which can be used on all platforms, but does not use shortcircuiting
   optimisations. You can set the `temp` and `switch_below` keyword arguments to be forwarded to
@@ -201,9 +201,9 @@ it in your application. When only one thread is needed, there is no overhead.
 
 ## GPU
 There are two possible `alg` choices:
-- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; 
-  and uses a global flag to write the result; this is only one platform we are aware of (Intel UHD
-  620 integrated graphics cards) where such writes hang.
+- `ConcurrentWrite()`: the default algorithm, using concurrent writing to a global flag; there is
+  only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple
+  threads writing to the same memory location - even if writing the same value - hang the device.
 - `MapReduce(; temp=nothing, switch_below=0)`: a conservative [`mapreduce`](@ref)-based
   implementation which can be used on all platforms, but does not use shortcircuiting
   optimisations. You can set the `temp` and `switch_below` keyword arguments to be forwarded to