Skip to content

Conversation

@simeonschaub
Copy link
Member

ref #352

Unfortunately, I don't really see any performance improvements with
this, any ideas why? I expected this to be quite a bit faster.

@codecov
Copy link

codecov bot commented Oct 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.01%. Comparing base (7c4881d) to head (96192c2).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #383   +/-   ##
=======================================
  Coverage   79.01%   79.01%           
=======================================
  Files          12       12           
  Lines         672      672           
=======================================
  Hits          531      531           
  Misses        141      141           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ref #352

Unfortunately, I don't really see any performance improvements with
this, any ideas why? I expected this to be quite a bit faster.
@maleadt
Copy link
Member

maleadt commented Oct 13, 2025

Unfortunately, I don't really see any performance improvements with
this, any ideas why? I expected this to be quite a bit faster.

On which back-end? For the CPU back-end, I wouldn't expect shuffle intrinsics to yield any speed-up from shared memory, as they're likely emulated using shared storage anyway.

@simeonschaub
Copy link
Member Author

Yes, this was with the pocl CPU backend. I thought pocl might have some tricks for optimizing shuffles to be more efficient than using local memory, but seems like you are right. Do you have any ideas to get performance closer to that in #356? At the moment, copying an array to the host, doing the reduction in series and copying back to the device is a lot faster for me than using OpenCL.jl's mapreduce implementation.

@VarLad
Copy link
Contributor

VarLad commented Oct 13, 2025

@simeonschaub hey! Apologies for the noise, but out of curiosity, I wonder if you can try Mesa's (latest) Rusticl + llvmpipe instead, to see how the performance compares. 👀

@maleadt
Copy link
Member

maleadt commented Oct 14, 2025

Do you have any ideas to get performance closer to that in #356?

I don't have good intuition yet which areas PoCL struggles with. Given that you have a fast version already, I'd try incrementally adding complexity to see which "feature" introduces the problem: is it the CartesianIndices, does that result in more complicated code that breaks some sort of auto-vectorization, is it the use of atomics vs multiple reductions, etc.

@simeonschaub
Copy link
Member Author

I wonder if you can try Mesa's (latest) Rusticl + llvmpipe instead, to see how the performance compares.

For Rusticl + llvmpipe, the shuffle-based implementation is actually a tiny bit faster than the shared memory one, but both are quite a bit slower than pocl:

julia> X = rand(Float32, 1000, 1000);

julia> cl.platform!("rusticl");

julia> X′ = CLArray(X);

julia> # shared memory based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 408 samples with 1 evaluation per sample.
 Range (min … max):  10.122 ms … 17.247 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.192 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.215 ms ±  1.373 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▁▅██▆▅▂▂▂  ▄ ▁ ▁▁    ▁▁▅▁ ▃▁▇▂▅▃ ▅ ▃                       
  ▅███████████▆█▇█▄███▇█████████████▆███▇▅▄▁▄▅▃▄▃▄▃▁▄▅▃▄▃▁▄▁▃ ▅
  10.1 ms         Histogram: frequency by time        15.7 ms <

 Memory estimate: 20.20 KiB, allocs estimate: 228.

julia> # shuffle based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 511 samples with 1 evaluation per sample.
 Range (min … max):  7.800 ms … 12.629 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.758 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.784 ms ±  1.105 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▆ ▁▃▃▆▆▂▆▂▃▃▂▂▄             ▁ ▅▅ █▆▃▄▇▂▆               
  ▃▃▃▃██▇█████████████▅▅▇▆█▅█▇▄▅▇▇██▇██▅███████▄▄▅▇▆▂▃▃▃▃▂▃▃ ▅
  7.8 ms         Histogram: frequency by time        12.1 ms <

 Memory estimate: 20.20 KiB, allocs estimate: 228.

julia> cl.platform!("Portable Computing Language");

julia> X′ = CLArray(X);

julia> # shared memory based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 1325 samples with 1 evaluation per sample.
 Range (min … max):  2.410 ms …   7.233 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.579 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.756 ms ± 578.443 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▃▇█▆▆▃▁▁▁                                      
  ▂▁▂▂▂▂▁▂▂▂▂▄▅██████████▇█▆▇▇▆▆▆▅▇▅▆▅▅▄▄▅▃▃▄▃▃▃▂▃▂▂▃▃▂▁▂▁▂▂▂ ▄
  2.41 ms         Histogram: frequency by time        5.68 ms <

 Memory estimate: 22.55 KiB, allocs estimate: 248.

julia> # shuffle based reduction
       @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 829 samples with 1 evaluation per sample.
 Range (min … max):  4.399 ms …   9.598 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.873 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.017 ms ± 593.533 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▂ ▇██▆▆▆▆▄▂▃▁ ▂                              
  ▂▁▁▃▁▁▂▁▁▁▁▃▄▄▅▇█▇█████████████▆▆▆▅▆▆▆▄▄▄▅▃▄▄▄▃▄▃▃▃▃▂▂▁▄▂▂▂ ▄
  4.4 ms          Histogram: frequency by time        7.95 ms <

 Memory estimate: 22.55 KiB, allocs estimate: 248.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants