optimized `mapreduce` using sub group shuffle #383

simeonschaub · 2025-10-11T10:19:34Z

Unfortunately, I don't really see any performance improvements with
this, any ideas why? I expected this to be quite a bit faster.

codecov · 2025-10-11T10:21:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.01%. Comparing base (7c4881d) to head (96192c2).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #383   +/-   ##
=======================================
  Coverage   79.01%   79.01%           
=======================================
  Files          12       12           
  Lines         672      672           
=======================================
  Hits          531      531           
  Misses        141      141

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ref #352 Unfortunately, I don't really see any performance improvements with this, any ideas why? I expected this to be quite a bit faster.

maleadt · 2025-10-13T19:13:19Z

Unfortunately, I don't really see any performance improvements with
this, any ideas why? I expected this to be quite a bit faster.

On which back-end? For the CPU back-end, I wouldn't expect shuffle intrinsics to yield any speed-up from shared memory, as they're likely emulated using shared storage anyway.

simeonschaub · 2025-10-13T20:46:14Z

Yes, this was with the pocl CPU backend. I thought pocl might have some tricks for optimizing shuffles to be more efficient than using local memory, but seems like you are right. Do you have any ideas to get performance closer to that in #356? At the moment, copying an array to the host, doing the reduction in series and copying back to the device is a lot faster for me than using OpenCL.jl's mapreduce implementation.

VarLad · 2025-10-13T21:52:04Z

@simeonschaub hey! Apologies for the noise, but out of curiosity, I wonder if you can try Mesa's (latest) Rusticl + llvmpipe instead, to see how the performance compares. 👀

maleadt · 2025-10-14T04:48:22Z

Do you have any ideas to get performance closer to that in #356?

I don't have good intuition yet which areas PoCL struggles with. Given that you have a fast version already, I'd try incrementally adding complexity to see which "feature" introduces the problem: is it the CartesianIndices, does that result in more complicated code that breaks some sort of auto-vectorization, is it the use of atomics vs multiple reductions, etc.

simeonschaub · 2025-10-14T11:02:44Z

I wonder if you can try Mesa's (latest) Rusticl + llvmpipe instead, to see how the performance compares.

For Rusticl + llvmpipe, the shuffle-based implementation is actually a tiny bit faster than the shared memory one, but both are quite a bit slower than pocl:

julia> X = rand(Float32, 1000, 1000);

julia> cl.platform!("rusticl");

julia> X′ = CLArray(X);

julia> # shared memory based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 408 samples with 1 evaluation per sample.
 Range (min … max):  10.122 ms … 17.247 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.192 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.215 ms ±  1.373 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▁▅██▆▅▂▂▂  ▄ ▁ ▁▁    ▁▁▅▁ ▃▁▇▂▅▃ ▅ ▃                       
  ▅███████████▆█▇█▄███▇█████████████▆███▇▅▄▁▄▅▃▄▃▄▃▁▄▅▃▄▃▁▄▁▃ ▅
  10.1 ms         Histogram: frequency by time        15.7 ms <

 Memory estimate: 20.20 KiB, allocs estimate: 228.

julia> # shuffle based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 511 samples with 1 evaluation per sample.
 Range (min … max):  7.800 ms … 12.629 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.758 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.784 ms ±  1.105 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▆ ▁▃▃▆▆▂▆▂▃▃▂▂▄             ▁ ▅▅ █▆▃▄▇▂▆               
  ▃▃▃▃██▇█████████████▅▅▇▆█▅█▇▄▅▇▇██▇██▅███████▄▄▅▇▆▂▃▃▃▃▂▃▃ ▅
  7.8 ms         Histogram: frequency by time        12.1 ms <

 Memory estimate: 20.20 KiB, allocs estimate: 228.

julia> cl.platform!("Portable Computing Language");

julia> X′ = CLArray(X);

julia> # shared memory based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 1325 samples with 1 evaluation per sample.
 Range (min … max):  2.410 ms …   7.233 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.579 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.756 ms ± 578.443 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▃▇█▆▆▃▁▁▁                                      
  ▂▁▂▂▂▂▁▂▂▂▂▄▅██████████▇█▆▇▇▆▆▆▅▇▅▆▅▅▄▄▅▃▃▄▃▃▃▂▃▂▂▃▃▂▁▂▁▂▂▂ ▄
  2.41 ms         Histogram: frequency by time        5.68 ms <

 Memory estimate: 22.55 KiB, allocs estimate: 248.

julia> # shuffle based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 829 samples with 1 evaluation per sample.
 Range (min … max):  4.399 ms …   9.598 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.873 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.017 ms ± 593.533 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▂ ▇██▆▆▆▆▄▂▃▁ ▂                              
  ▂▁▁▃▁▁▂▁▁▁▁▃▄▄▅▇█▇█████████████▆▆▆▅▆▆▆▄▄▄▅▃▄▄▄▃▄▃▃▃▃▂▂▁▄▂▂▂ ▄
  4.4 ms          Histogram: frequency by time        7.95 ms <

 Memory estimate: 22.55 KiB, allocs estimate: 248.

simeonschaub force-pushed the sds/mapreduce branch from 7d14d4a to 7b8ddaa Compare October 11, 2025 14:29

optimized mapreduce using sub group shuffle

48a86d6

ref #352 Unfortunately, I don't really see any performance improvements with this, any ideas why? I expected this to be quite a bit faster.

simeonschaub force-pushed the sds/mapreduce branch from 7b8ddaa to 48a86d6 Compare October 11, 2025 14:30

simeonschaub added 3 commits October 13, 2025 11:07

fix non-power of 2 sub group size

b9a0c43

properly mangle sub group shuffle builtin

ccacd0a

wip

96192c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimized `mapreduce` using sub group shuffle #383

optimized `mapreduce` using sub group shuffle #383

simeonschaub commented Oct 11, 2025

Uh oh!

codecov bot commented Oct 11, 2025 •

edited

Loading

Uh oh!

maleadt commented Oct 13, 2025

Uh oh!

simeonschaub commented Oct 13, 2025

Uh oh!

VarLad commented Oct 13, 2025 •

edited

Loading

Uh oh!

maleadt commented Oct 14, 2025

Uh oh!

simeonschaub commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optimized mapreduce using sub group shuffle #383

Are you sure you want to change the base?

optimized mapreduce using sub group shuffle #383

Conversation

simeonschaub commented Oct 11, 2025

Uh oh!

codecov bot commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maleadt commented Oct 13, 2025

Uh oh!

simeonschaub commented Oct 13, 2025

Uh oh!

VarLad commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Oct 14, 2025

Uh oh!

simeonschaub commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optimized `mapreduce` using sub group shuffle #383

optimized `mapreduce` using sub group shuffle #383

codecov bot commented Oct 11, 2025 •

edited

Loading

VarLad commented Oct 13, 2025 •

edited

Loading