Skip to content

Add override for muladd and use LLVM intrinsic for fma#3078

Open
vchuravy wants to merge 1 commit intomasterfrom
vc/faster_muladd
Open

Add override for muladd and use LLVM intrinsic for fma#3078
vchuravy wants to merge 1 commit intomasterfrom
vc/faster_muladd

Conversation

@vchuravy
Copy link
Copy Markdown
Member

@vchuravy vchuravy commented Apr 2, 2026

Since Julia 0.7 (JuliaLang/julia#22262) we are emitting

muladd(a,b,c) not as llvm.fmuladd, but rather as a sequence of:

%t = fmul contract %a %b
%r = fadd contract %t %c

The reason for that is vectorization of a potential reduction
(which is something we ought to investigate in Base if it is still worthwhile).

@efaulhaber had an example where

h = ...
for ...
  distance ...
  muladd(epsilon, h^2, distance^2)

and LLVM helpfully performed some code motion:

h = ...
t = mul(epsilon, h^2)
for ...
   add(t, distance^2)

Leading to a torn contract pair

%69 = fmul contract float %"f::#parallel_foreach##10#parallel_foreach##11.fca.0.4.5.2.extract", %68
br label %L619, !dbg !197
;...
 %141 = fadd contract float %69, %140, !dbg !459

Manually using a fma leads to a performance improvement of 2.908 to 2.765 so ~5% faster.

I believe that the motivation in Base is not valid on GPUs since we benefit much more from the emission of fma,
in contrast to reduction vectorization.

@maleadt do you recall why we are using __nv_fma? LLVM should be able to perform better optimization over the llvm.fma intrinsic.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.41%. Comparing base (a79b516) to head (95df3a5).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3078      +/-   ##
==========================================
- Coverage   90.42%   90.41%   -0.01%     
==========================================
  Files         141      141              
  Lines       11993    11993              
==========================================
- Hits        10845    10844       -1     
- Misses       1148     1149       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 95df3a5 Previous: a79b516 Ratio
array/accumulate/Float32/1d 101647 ns 101309 ns 1.00
array/accumulate/Float32/dims=1 76663.5 ns 76747 ns 1.00
array/accumulate/Float32/dims=1L 1592899.5 ns 1585609 ns 1.00
array/accumulate/Float32/dims=2 143993 ns 143412 ns 1.00
array/accumulate/Float32/dims=2L 659749 ns 657151 ns 1.00
array/accumulate/Int64/1d 118555 ns 118450 ns 1.00
array/accumulate/Int64/dims=1 79954 ns 79685 ns 1.00
array/accumulate/Int64/dims=1L 1703598 ns 1694399 ns 1.01
array/accumulate/Int64/dims=2 156789.5 ns 155494.5 ns 1.01
array/accumulate/Int64/dims=2L 961629 ns 961001 ns 1.00
array/broadcast 20765 ns 20538 ns 1.01
array/construct 1358.6 ns 1298.9 ns 1.05
array/copy 18777 ns 18512 ns 1.01
array/copyto!/cpu_to_gpu 213167.5 ns 213295 ns 1.00
array/copyto!/gpu_to_cpu 282266 ns 284330.5 ns 0.99
array/copyto!/gpu_to_gpu 11449 ns 11273 ns 1.02
array/iteration/findall/bool 131889.5 ns 132165 ns 1.00
array/iteration/findall/int 149381 ns 148572 ns 1.01
array/iteration/findfirst/bool 81452 ns 81324.5 ns 1.00
array/iteration/findfirst/int 83235 ns 83910 ns 0.99
array/iteration/findmin/1d 88759 ns 88268.5 ns 1.01
array/iteration/findmin/2d 117181 ns 116719 ns 1.00
array/iteration/logical 200257 ns 201488.5 ns 0.99
array/iteration/scalar 68170.5 ns 67192 ns 1.01
array/permutedims/2d 52622.5 ns 52378 ns 1.00
array/permutedims/3d 52880 ns 52726 ns 1.00
array/permutedims/4d 52451 ns 51596 ns 1.02
array/random/rand/Float32 13113 ns 13097 ns 1.00
array/random/rand/Int64 37221 ns 37319 ns 1.00
array/random/rand!/Float32 8508.666666666666 ns 8581.666666666666 ns 0.99
array/random/rand!/Int64 34240 ns 34312 ns 1.00
array/random/randn/Float32 38314.5 ns 38478.5 ns 1.00
array/random/randn!/Float32 31457 ns 31422.5 ns 1.00
array/reductions/mapreduce/Float32/1d 35432 ns 34936 ns 1.01
array/reductions/mapreduce/Float32/dims=1 43421 ns 49501 ns 0.88
array/reductions/mapreduce/Float32/dims=1L 51920 ns 51907 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56555 ns 56747.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 69808.5 ns 69513 ns 1.00
array/reductions/mapreduce/Int64/1d 43206 ns 43154 ns 1.00
array/reductions/mapreduce/Int64/dims=1 45255 ns 43838 ns 1.03
array/reductions/mapreduce/Int64/dims=1L 87626 ns 87668 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59583 ns 59424 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 85161.5 ns 84576 ns 1.01
array/reductions/reduce/Float32/1d 35338 ns 34859 ns 1.01
array/reductions/reduce/Float32/dims=1 39558 ns 39947.5 ns 0.99
array/reductions/reduce/Float32/dims=1L 51799 ns 51723 ns 1.00
array/reductions/reduce/Float32/dims=2 56709 ns 56768 ns 1.00
array/reductions/reduce/Float32/dims=2L 70238.5 ns 69769.5 ns 1.01
array/reductions/reduce/Int64/1d 42619 ns 42778 ns 1.00
array/reductions/reduce/Int64/dims=1 42056 ns 44289 ns 0.95
array/reductions/reduce/Int64/dims=1L 87648 ns 87701 ns 1.00
array/reductions/reduce/Int64/dims=2 59565 ns 59510 ns 1.00
array/reductions/reduce/Int64/dims=2L 84977 ns 84815 ns 1.00
array/reverse/1d 18252 ns 18338 ns 1.00
array/reverse/1dL 68749 ns 68805 ns 1.00
array/reverse/1dL_inplace 65972 ns 65983 ns 1.00
array/reverse/1d_inplace 10348.333333333334 ns 8621.333333333334 ns 1.20
array/reverse/2d 20477 ns 20615 ns 0.99
array/reverse/2dL 72544 ns 72573 ns 1.00
array/reverse/2dL_inplace 66010.5 ns 66098 ns 1.00
array/reverse/2d_inplace 10283 ns 10260 ns 1.00
array/sorting/1d 2743775 ns 2735030 ns 1.00
array/sorting/2d 1069335 ns 1071674 ns 1.00
array/sorting/by 3305691 ns 3313782 ns 1.00
cuda/synchronization/context/auto 1191.4 ns 1186.2 ns 1.00
cuda/synchronization/context/blocking 936.9642857142857 ns 924.0487804878048 ns 1.01
cuda/synchronization/context/nonblocking 7326.6 ns 7835.8 ns 0.94
cuda/synchronization/stream/auto 1059 ns 1041.2 ns 1.02
cuda/synchronization/stream/blocking 805.51 ns 835.7402597402597 ns 0.96
cuda/synchronization/stream/nonblocking 7708.299999999999 ns 7438.2 ns 1.04
integration/byval/reference 143965 ns 144123 ns 1.00
integration/byval/slices=1 145987 ns 146064 ns 1.00
integration/byval/slices=2 284683 ns 284754 ns 1.00
integration/byval/slices=3 423085 ns 423302 ns 1.00
integration/cudadevrt 102557 ns 102654 ns 1.00
integration/volumerhs 9442248.5 ns 9450427 ns 1.00
kernel/indexing 13392 ns 13382 ns 1.00
kernel/indexing_checked 14086 ns 14092 ns 1.00
kernel/launch 2201.3333333333335 ns 2292.8888888888887 ns 0.96
kernel/occupancy 659.3625 ns 675.4013157894736 ns 0.98
kernel/rand 17118 ns 17995 ns 0.95
latency/import 3825771360.5 ns 3823445090 ns 1.00
latency/precompile 4583776029 ns 4598939035 ns 1.00
latency/ttfp 4395271001 ns 4399692793 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant