- 
                Notifications
    You must be signed in to change notification settings 
- Fork 46
[Do not merge] Switch to GPUArrays.jl reduction implementation #628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/perf/runbenchmarks.jl b/perf/runbenchmarks.jl
index ba5e0d40..1d7901c5 100644
--- a/perf/runbenchmarks.jl
+++ b/perf/runbenchmarks.jl
@@ -1,6 +1,6 @@
 # benchmark suite execution and codespeed submission
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
 
 using Metal
 
diff --git a/test/runtests.jl b/test/runtests.jl
index 4ee51134..fb376e4f 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
 using Test
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
 
 # Quit without erroring if Metal loaded without issues on unsupported platforms
 if !Sys.isapple() | 
| Leaving the current  | 
| Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@           Coverage Diff           @@
##             main     #628   +/-   ##
=======================================
  Coverage   80.63%   80.63%           
=======================================
  Files          61       61           
  Lines        2722     2722           
=======================================
  Hits         2195     2195           
  Misses        527      527           ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
| Benchmark suite | Current: c0eddd1 | Previous: 1942968 | Ratio | 
|---|---|---|---|
| latency/precompile | 9830015416ns | 9844653958ns | 1.00 | 
| latency/ttfp | 3989128875ns | 3972040229ns | 1.00 | 
| latency/import | 1281988208ns | 1275530958.5ns | 1.01 | 
| integration/metaldevrt | 830312.5ns | 828500ns | 1.00 | 
| integration/byval/slices=1 | 1532291.5ns | 1536750ns | 1.00 | 
| integration/byval/slices=3 | 8864917ns | 9632625ns | 0.92 | 
| integration/byval/reference | 1535333ns | 1543583ns | 0.99 | 
| integration/byval/slices=2 | 2554083ns | 2621958.5ns | 0.97 | 
| kernel/indexing | 582792ns | 567792ns | 1.03 | 
| kernel/indexing_checked | 577208ns | 569292ns | 1.01 | 
| kernel/launch | 9042ns | 9208ns | 0.98 | 
| array/construct | 6125ns | 6625ns | 0.92 | 
| array/broadcast | 579250ns | 583375ns | 0.99 | 
| array/random/randn/Float32 | 821167ns | 784333ns | 1.05 | 
| array/random/randn!/Float32 | 622625ns | 623250ns | 1.00 | 
| array/random/rand!/Int64 | 555395.5ns | 547458ns | 1.01 | 
| array/random/rand!/Float32 | 584125ns | 585291ns | 1.00 | 
| array/random/rand/Int64 | 777375ns | 771250ns | 1.01 | 
| array/random/rand/Float32 | 628375ns | 622687ns | 1.01 | 
| array/accumulate/Int64/1d | 1261292ns | 1277104.5ns | 0.99 | 
| array/accumulate/Int64/dims=1 | 1800500ns | 1868333ns | 0.96 | 
| array/accumulate/Int64/dims=2 | 2165958.5ns | 2183625ns | 0.99 | 
| array/accumulate/Int64/dims=1L | 11643104ns | 11737104ns | 0.99 | 
| array/accumulate/Int64/dims=2L | 9718917ns | 9771416.5ns | 0.99 | 
| array/accumulate/Float32/1d | 1141375ns | 1142833ns | 1.00 | 
| array/accumulate/Float32/dims=1 | 1562333.5ns | 1570458ns | 0.99 | 
| array/accumulate/Float32/dims=2 | 1865875ns | 1931625ns | 0.97 | 
| array/accumulate/Float32/dims=1L | 9890916.5ns | 9864375ns | 1.00 | 
| array/accumulate/Float32/dims=2L | 7298500ns | 7308021ns | 1.00 | 
| array/reductions/reduce/Int64/1d | 1077583ns | 1373353.5ns | 0.78 | 
| array/reductions/reduce/Int64/dims=1 | 987500ns | 1069291.5ns | 0.92 | 
| array/reductions/reduce/Int64/dims=2 | 935145.5ns | 1193292ns | 0.78 | 
| array/reductions/reduce/Int64/dims=1L | 2350750ns | 2113062.5ns | 1.11 | 
| array/reductions/reduce/Int64/dims=2L | 2815291ns | 3456458ns | 0.81 | 
| array/reductions/reduce/Float32/1d | 1029750ns | 971625ns | 1.06 | 
| array/reductions/reduce/Float32/dims=1 | 956125ns | 808458ns | 1.18 | 
| array/reductions/reduce/Float32/dims=2 | 870375ns | 768979ns | 1.13 | 
| array/reductions/reduce/Float32/dims=1L | 1659354.5ns | 1739041ns | 0.95 | 
| array/reductions/reduce/Float32/dims=2L | 2781167ns | 1772125ns | 1.57 | 
| array/reductions/mapreduce/Int64/1d | 1000375ns | 1456146ns | 0.69 | 
| array/reductions/mapreduce/Int64/dims=1 | 936083ns | 1074875ns | 0.87 | 
| array/reductions/mapreduce/Int64/dims=2 | 873500ns | 1206417ns | 0.72 | 
| array/reductions/mapreduce/Int64/dims=1L | 2346562.5ns | 2119292ns | 1.11 | 
| array/reductions/mapreduce/Int64/dims=2L | 2844729ns | 3444375ns | 0.83 | 
| array/reductions/mapreduce/Float32/1d | 1045959ns | 990792ns | 1.06 | 
| array/reductions/mapreduce/Float32/dims=1 | 947959ns | 810062.5ns | 1.17 | 
| array/reductions/mapreduce/Float32/dims=2 | 868041.5ns | 761104ns | 1.14 | 
| array/reductions/mapreduce/Float32/dims=1L | 1668167ns | 1740812.5ns | 0.96 | 
| array/reductions/mapreduce/Float32/dims=2L | 2815354.5ns | 1781292ns | 1.58 | 
| array/private/copyto!/gpu_to_gpu | 636791ns | 651375ns | 0.98 | 
| array/private/copyto!/cpu_to_gpu | 795791ns | 805542ns | 0.99 | 
| array/private/copyto!/gpu_to_cpu | 811292ns | 817667ns | 0.99 | 
| array/private/iteration/findall/int | 1657000ns | 1646500ns | 1.01 | 
| array/private/iteration/findall/bool | 1451937.5ns | 1444584ns | 1.01 | 
| array/private/iteration/findfirst/int | 2074750ns | 1754958.5ns | 1.18 | 
| array/private/iteration/findfirst/bool | 1635145.5ns | 1703625ns | 0.96 | 
| array/private/iteration/scalar | 5542583.5ns | 4772500ns | 1.16 | 
| array/private/iteration/logical | 2734958ns | 2536917ns | 1.08 | 
| array/private/iteration/findmin/1d | 1870167ns | 1815666ns | 1.03 | 
| array/private/iteration/findmin/2d | 1891583.5ns | 1431750ns | 1.32 | 
| array/private/copy | 573791.5ns | 538167ns | 1.07 | 
| array/shared/copyto!/gpu_to_gpu | 83750ns | 86375ns | 0.97 | 
| array/shared/copyto!/cpu_to_gpu | 82625ns | 86583ns | 0.95 | 
| array/shared/copyto!/gpu_to_cpu | 91458ns | 84833ns | 1.08 | 
| array/shared/iteration/findall/int | 1643437.5ns | 1609874.5ns | 1.02 | 
| array/shared/iteration/findall/bool | 1471812.5ns | 1464354ns | 1.01 | 
| array/shared/iteration/findfirst/int | 1830375ns | 1377750ns | 1.33 | 
| array/shared/iteration/findfirst/bool | 1385917ns | 1319166ns | 1.05 | 
| array/shared/iteration/scalar | 206917ns | 217500ns | 0.95 | 
| array/shared/iteration/logical | 2750042ns | 2288708.5ns | 1.20 | 
| array/shared/iteration/findmin/1d | 1607895.5ns | 1421750ns | 1.13 | 
| array/shared/iteration/findmin/2d | 1917291.5ns | 1430854.5ns | 1.34 | 
| array/shared/copy | 251042ns | 248666ns | 1.01 | 
| array/permutedims/4d | 2442208ns | 2438438ns | 1.00 | 
| array/permutedims/2d | 1184291.5ns | 1193250ns | 0.99 | 
| array/permutedims/3d | 1737625ns | 1768458ns | 0.98 | 
| metal/synchronization/stream | 19667ns | 19916ns | 0.99 | 
| metal/synchronization/context | 20292ns | 20375ns | 1.00 | 
This comment was automatically generated by workflow using github-action-benchmark.
| 
 I think I'd rather we do it in one pass, because the change needs to be made across back-ends. | 
| In any case, despite some regressions the overall performance seems better here than over in CUDA.jl. | 
Don't remove the file yet to avoid merge conflict with #627