Add override for muladd and use LLVM intrinsic for fma#3078
Open
Add override for muladd and use LLVM intrinsic for fma#3078
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3078 +/- ##
==========================================
- Coverage 90.42% 90.41% -0.01%
==========================================
Files 141 141
Lines 11993 11993
==========================================
- Hits 10845 10844 -1
- Misses 1148 1149 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 95df3a5 | Previous: a79b516 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101647 ns |
101309 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76663.5 ns |
76747 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1592899.5 ns |
1585609 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143993 ns |
143412 ns |
1.00 |
array/accumulate/Float32/dims=2L |
659749 ns |
657151 ns |
1.00 |
array/accumulate/Int64/1d |
118555 ns |
118450 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79954 ns |
79685 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1703598 ns |
1694399 ns |
1.01 |
array/accumulate/Int64/dims=2 |
156789.5 ns |
155494.5 ns |
1.01 |
array/accumulate/Int64/dims=2L |
961629 ns |
961001 ns |
1.00 |
array/broadcast |
20765 ns |
20538 ns |
1.01 |
array/construct |
1358.6 ns |
1298.9 ns |
1.05 |
array/copy |
18777 ns |
18512 ns |
1.01 |
array/copyto!/cpu_to_gpu |
213167.5 ns |
213295 ns |
1.00 |
array/copyto!/gpu_to_cpu |
282266 ns |
284330.5 ns |
0.99 |
array/copyto!/gpu_to_gpu |
11449 ns |
11273 ns |
1.02 |
array/iteration/findall/bool |
131889.5 ns |
132165 ns |
1.00 |
array/iteration/findall/int |
149381 ns |
148572 ns |
1.01 |
array/iteration/findfirst/bool |
81452 ns |
81324.5 ns |
1.00 |
array/iteration/findfirst/int |
83235 ns |
83910 ns |
0.99 |
array/iteration/findmin/1d |
88759 ns |
88268.5 ns |
1.01 |
array/iteration/findmin/2d |
117181 ns |
116719 ns |
1.00 |
array/iteration/logical |
200257 ns |
201488.5 ns |
0.99 |
array/iteration/scalar |
68170.5 ns |
67192 ns |
1.01 |
array/permutedims/2d |
52622.5 ns |
52378 ns |
1.00 |
array/permutedims/3d |
52880 ns |
52726 ns |
1.00 |
array/permutedims/4d |
52451 ns |
51596 ns |
1.02 |
array/random/rand/Float32 |
13113 ns |
13097 ns |
1.00 |
array/random/rand/Int64 |
37221 ns |
37319 ns |
1.00 |
array/random/rand!/Float32 |
8508.666666666666 ns |
8581.666666666666 ns |
0.99 |
array/random/rand!/Int64 |
34240 ns |
34312 ns |
1.00 |
array/random/randn/Float32 |
38314.5 ns |
38478.5 ns |
1.00 |
array/random/randn!/Float32 |
31457 ns |
31422.5 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
35432 ns |
34936 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
43421 ns |
49501 ns |
0.88 |
array/reductions/mapreduce/Float32/dims=1L |
51920 ns |
51907 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56555 ns |
56747.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69808.5 ns |
69513 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
43206 ns |
43154 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
45255 ns |
43838 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=1L |
87626 ns |
87668 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59583 ns |
59424 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
85161.5 ns |
84576 ns |
1.01 |
array/reductions/reduce/Float32/1d |
35338 ns |
34859 ns |
1.01 |
array/reductions/reduce/Float32/dims=1 |
39558 ns |
39947.5 ns |
0.99 |
array/reductions/reduce/Float32/dims=1L |
51799 ns |
51723 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56709 ns |
56768 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70238.5 ns |
69769.5 ns |
1.01 |
array/reductions/reduce/Int64/1d |
42619 ns |
42778 ns |
1.00 |
array/reductions/reduce/Int64/dims=1 |
42056 ns |
44289 ns |
0.95 |
array/reductions/reduce/Int64/dims=1L |
87648 ns |
87701 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59565 ns |
59510 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84977 ns |
84815 ns |
1.00 |
array/reverse/1d |
18252 ns |
18338 ns |
1.00 |
array/reverse/1dL |
68749 ns |
68805 ns |
1.00 |
array/reverse/1dL_inplace |
65972 ns |
65983 ns |
1.00 |
array/reverse/1d_inplace |
10348.333333333334 ns |
8621.333333333334 ns |
1.20 |
array/reverse/2d |
20477 ns |
20615 ns |
0.99 |
array/reverse/2dL |
72544 ns |
72573 ns |
1.00 |
array/reverse/2dL_inplace |
66010.5 ns |
66098 ns |
1.00 |
array/reverse/2d_inplace |
10283 ns |
10260 ns |
1.00 |
array/sorting/1d |
2743775 ns |
2735030 ns |
1.00 |
array/sorting/2d |
1069335 ns |
1071674 ns |
1.00 |
array/sorting/by |
3305691 ns |
3313782 ns |
1.00 |
cuda/synchronization/context/auto |
1191.4 ns |
1186.2 ns |
1.00 |
cuda/synchronization/context/blocking |
936.9642857142857 ns |
924.0487804878048 ns |
1.01 |
cuda/synchronization/context/nonblocking |
7326.6 ns |
7835.8 ns |
0.94 |
cuda/synchronization/stream/auto |
1059 ns |
1041.2 ns |
1.02 |
cuda/synchronization/stream/blocking |
805.51 ns |
835.7402597402597 ns |
0.96 |
cuda/synchronization/stream/nonblocking |
7708.299999999999 ns |
7438.2 ns |
1.04 |
integration/byval/reference |
143965 ns |
144123 ns |
1.00 |
integration/byval/slices=1 |
145987 ns |
146064 ns |
1.00 |
integration/byval/slices=2 |
284683 ns |
284754 ns |
1.00 |
integration/byval/slices=3 |
423085 ns |
423302 ns |
1.00 |
integration/cudadevrt |
102557 ns |
102654 ns |
1.00 |
integration/volumerhs |
9442248.5 ns |
9450427 ns |
1.00 |
kernel/indexing |
13392 ns |
13382 ns |
1.00 |
kernel/indexing_checked |
14086 ns |
14092 ns |
1.00 |
kernel/launch |
2201.3333333333335 ns |
2292.8888888888887 ns |
0.96 |
kernel/occupancy |
659.3625 ns |
675.4013157894736 ns |
0.98 |
kernel/rand |
17118 ns |
17995 ns |
0.95 |
latency/import |
3825771360.5 ns |
3823445090 ns |
1.00 |
latency/precompile |
4583776029 ns |
4598939035 ns |
1.00 |
latency/ttfp |
4395271001 ns |
4399692793 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Since Julia 0.7 (JuliaLang/julia#22262) we are emitting
muladd(a,b,c)not asllvm.fmuladd, but rather as a sequence of:The reason for that is vectorization of a potential reduction
(which is something we ought to investigate in Base if it is still worthwhile).
@efaulhaber had an example where
and LLVM helpfully performed some code motion:
Leading to a torn
contractpairManually using a fma leads to a performance improvement of
2.908to2.765so ~5% faster.I believe that the motivation in Base is not valid on GPUs since we benefit much more from the emission of fma,
in contrast to reduction vectorization.
@maleadt do you recall why we are using
__nv_fma? LLVM should be able to perform better optimization over thellvm.fmaintrinsic.