-
Notifications
You must be signed in to change notification settings - Fork 44
optimized mapreduce using sub group shuffle
#383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #383 +/- ##
=======================================
Coverage 79.01% 79.01%
=======================================
Files 12 12
Lines 672 672
=======================================
Hits 531 531
Misses 141 141 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7d14d4a to
7b8ddaa
Compare
ref #352 Unfortunately, I don't really see any performance improvements with this, any ideas why? I expected this to be quite a bit faster.
7b8ddaa to
48a86d6
Compare
On which back-end? For the CPU back-end, I wouldn't expect shuffle intrinsics to yield any speed-up from shared memory, as they're likely emulated using shared storage anyway. |
|
Yes, this was with the pocl CPU backend. I thought pocl might have some tricks for optimizing shuffles to be more efficient than using local memory, but seems like you are right. Do you have any ideas to get performance closer to that in #356? At the moment, copying an array to the host, doing the reduction in series and copying back to the device is a lot faster for me than using OpenCL.jl's |
|
@simeonschaub hey! Apologies for the noise, but out of curiosity, I wonder if you can try Mesa's (latest) Rusticl + llvmpipe instead, to see how the performance compares. 👀 |
I don't have good intuition yet which areas PoCL struggles with. Given that you have a fast version already, I'd try incrementally adding complexity to see which "feature" introduces the problem: is it the CartesianIndices, does that result in more complicated code that breaks some sort of auto-vectorization, is it the use of atomics vs multiple reductions, etc. |
For Rusticl + llvmpipe, the shuffle-based implementation is actually a tiny bit faster than the shared memory one, but both are quite a bit slower than pocl: julia> X = rand(Float32, 1000, 1000);
julia> cl.platform!("rusticl");
julia> X′ = CLArray(X);
julia> # shared memory based reduction
@benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 408 samples with 1 evaluation per sample.
Range (min … max): 10.122 ms … 17.247 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 12.192 ms ┊ GC (median): 0.00%
Time (mean ± σ): 12.215 ms ± 1.373 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▅██▆▅▂▂▂ ▄ ▁ ▁▁ ▁▁▅▁ ▃▁▇▂▅▃ ▅ ▃
▅███████████▆█▇█▄███▇█████████████▆███▇▅▄▁▄▅▃▄▃▄▃▁▄▅▃▄▃▁▄▁▃ ▅
10.1 ms Histogram: frequency by time 15.7 ms <
Memory estimate: 20.20 KiB, allocs estimate: 228.
julia> # shuffle based reduction
@benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 511 samples with 1 evaluation per sample.
Range (min … max): 7.800 ms … 12.629 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.758 ms ┊ GC (median): 0.00%
Time (mean ± σ): 9.784 ms ± 1.105 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆ ▁▃▃▆▆▂▆▂▃▃▂▂▄ ▁ ▅▅ █▆▃▄▇▂▆
▃▃▃▃██▇█████████████▅▅▇▆█▅█▇▄▅▇▇██▇██▅███████▄▄▅▇▆▂▃▃▃▃▂▃▃ ▅
7.8 ms Histogram: frequency by time 12.1 ms <
Memory estimate: 20.20 KiB, allocs estimate: 228.
julia> cl.platform!("Portable Computing Language");
julia> X′ = CLArray(X);
julia> # shared memory based reduction
@benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 1325 samples with 1 evaluation per sample.
Range (min … max): 2.410 ms … 7.233 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.579 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.756 ms ± 578.443 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▇█▆▆▃▁▁▁
▂▁▂▂▂▂▁▂▂▂▂▄▅██████████▇█▆▇▇▆▆▆▅▇▅▆▅▅▄▄▅▃▃▄▃▃▃▂▃▂▂▃▃▂▁▂▁▂▂▂ ▄
2.41 ms Histogram: frequency by time 5.68 ms <
Memory estimate: 22.55 KiB, allocs estimate: 248.
julia> # shuffle based reduction
@benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 829 samples with 1 evaluation per sample.
Range (min … max): 4.399 ms … 9.598 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.873 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.017 ms ± 593.533 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▇██▆▆▆▆▄▂▃▁ ▂
▂▁▁▃▁▁▂▁▁▁▁▃▄▄▅▇█▇█████████████▆▆▆▅▆▆▆▄▄▄▅▃▄▄▄▃▄▃▃▃▃▂▂▁▄▂▂▂ ▄
4.4 ms Histogram: frequency by time 7.95 ms <
Memory estimate: 22.55 KiB, allocs estimate: 248. |
ref #352
Unfortunately, I don't really see any performance improvements with
this, any ideas why? I expected this to be quite a bit faster.