Add override for muladd and use LLVM intrinsic for fma by vchuravy · Pull Request #3078 · JuliaGPU/CUDA.jl

vchuravy · 2026-04-02T08:27:02Z

Since Julia 0.7 (JuliaLang/julia#22262) we are emitting

muladd(a,b,c) not as llvm.fmuladd, but rather as a sequence of:

%t = fmul contract %a %b
%r = fadd contract %t %c

The reason for that is vectorization of a potential reduction
(which is something we ought to investigate in Base if it is still worthwhile).

@efaulhaber had an example where

h = ...
for ...
  distance ...
  muladd(epsilon, h^2, distance^2)

and LLVM helpfully performed some code motion:

h = ...
t = mul(epsilon, h^2)
for ...
   add(t, distance^2)

Leading to a torn contract pair

%69 = fmul contract float %"f::#parallel_foreach##10#parallel_foreach##11.fca.0.4.5.2.extract", %68
br label %L619, !dbg !197
;...
 %141 = fadd contract float %69, %140, !dbg !459

Manually using a fma leads to a performance improvement of 2.908 to 2.765 so ~5% faster.

I believe that the motivation in Base is not valid on GPUs since we benefit much more from the emission of fma,
in contrast to reduction vectorization.

@maleadt do you recall why we are using __nv_fma? LLVM should be able to perform better optimization over the llvm.fma intrinsic.

codecov · 2026-04-02T11:12:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.41%. Comparing base (a79b516) to head (95df3a5).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3078      +/-   ##
==========================================
- Coverage   90.42%   90.41%   -0.01%     
==========================================
  Files         141      141              
  Lines       11993    11993              
==========================================
- Hits        10845    10844       -1     
- Misses       1148     1149       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `95df3a5`	Previous: `a79b516`	Ratio
`array/accumulate/Float32/1d`	`101647` ns	`101309` ns	`1.00`
`array/accumulate/Float32/dims=1`	`76663.5` ns	`76747` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1592899.5` ns	`1585609` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143993` ns	`143412` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`659749` ns	`657151` ns	`1.00`
`array/accumulate/Int64/1d`	`118555` ns	`118450` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79954` ns	`79685` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1703598` ns	`1694399` ns	`1.01`
`array/accumulate/Int64/dims=2`	`156789.5` ns	`155494.5` ns	`1.01`
`array/accumulate/Int64/dims=2L`	`961629` ns	`961001` ns	`1.00`
`array/broadcast`	`20765` ns	`20538` ns	`1.01`
`array/construct`	`1358.6` ns	`1298.9` ns	`1.05`
`array/copy`	`18777` ns	`18512` ns	`1.01`
`array/copyto!/cpu_to_gpu`	`213167.5` ns	`213295` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`282266` ns	`284330.5` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`11449` ns	`11273` ns	`1.02`
`array/iteration/findall/bool`	`131889.5` ns	`132165` ns	`1.00`
`array/iteration/findall/int`	`149381` ns	`148572` ns	`1.01`
`array/iteration/findfirst/bool`	`81452` ns	`81324.5` ns	`1.00`
`array/iteration/findfirst/int`	`83235` ns	`83910` ns	`0.99`
`array/iteration/findmin/1d`	`88759` ns	`88268.5` ns	`1.01`
`array/iteration/findmin/2d`	`117181` ns	`116719` ns	`1.00`
`array/iteration/logical`	`200257` ns	`201488.5` ns	`0.99`
`array/iteration/scalar`	`68170.5` ns	`67192` ns	`1.01`
`array/permutedims/2d`	`52622.5` ns	`52378` ns	`1.00`
`array/permutedims/3d`	`52880` ns	`52726` ns	`1.00`
`array/permutedims/4d`	`52451` ns	`51596` ns	`1.02`
`array/random/rand/Float32`	`13113` ns	`13097` ns	`1.00`
`array/random/rand/Int64`	`37221` ns	`37319` ns	`1.00`
`array/random/rand!/Float32`	`8508.666666666666` ns	`8581.666666666666` ns	`0.99`
`array/random/rand!/Int64`	`34240` ns	`34312` ns	`1.00`
`array/random/randn/Float32`	`38314.5` ns	`38478.5` ns	`1.00`
`array/random/randn!/Float32`	`31457` ns	`31422.5` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`35432` ns	`34936` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1`	`43421` ns	`49501` ns	`0.88`
`array/reductions/mapreduce/Float32/dims=1L`	`51920` ns	`51907` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`56555` ns	`56747.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`69808.5` ns	`69513` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`43206` ns	`43154` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`45255` ns	`43838` ns	`1.03`
`array/reductions/mapreduce/Int64/dims=1L`	`87626` ns	`87668` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`59583` ns	`59424` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`85161.5` ns	`84576` ns	`1.01`
`array/reductions/reduce/Float32/1d`	`35338` ns	`34859` ns	`1.01`
`array/reductions/reduce/Float32/dims=1`	`39558` ns	`39947.5` ns	`0.99`
`array/reductions/reduce/Float32/dims=1L`	`51799` ns	`51723` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`56709` ns	`56768` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`70238.5` ns	`69769.5` ns	`1.01`
`array/reductions/reduce/Int64/1d`	`42619` ns	`42778` ns	`1.00`
`array/reductions/reduce/Int64/dims=1`	`42056` ns	`44289` ns	`0.95`
`array/reductions/reduce/Int64/dims=1L`	`87648` ns	`87701` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`59565` ns	`59510` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84977` ns	`84815` ns	`1.00`
`array/reverse/1d`	`18252` ns	`18338` ns	`1.00`
`array/reverse/1dL`	`68749` ns	`68805` ns	`1.00`
`array/reverse/1dL_inplace`	`65972` ns	`65983` ns	`1.00`
`array/reverse/1d_inplace`	`10348.333333333334` ns	`8621.333333333334` ns	`1.20`
`array/reverse/2d`	`20477` ns	`20615` ns	`0.99`
`array/reverse/2dL`	`72544` ns	`72573` ns	`1.00`
`array/reverse/2dL_inplace`	`66010.5` ns	`66098` ns	`1.00`
`array/reverse/2d_inplace`	`10283` ns	`10260` ns	`1.00`
`array/sorting/1d`	`2743775` ns	`2735030` ns	`1.00`
`array/sorting/2d`	`1069335` ns	`1071674` ns	`1.00`
`array/sorting/by`	`3305691` ns	`3313782` ns	`1.00`
`cuda/synchronization/context/auto`	`1191.4` ns	`1186.2` ns	`1.00`
`cuda/synchronization/context/blocking`	`936.9642857142857` ns	`924.0487804878048` ns	`1.01`
`cuda/synchronization/context/nonblocking`	`7326.6` ns	`7835.8` ns	`0.94`
`cuda/synchronization/stream/auto`	`1059` ns	`1041.2` ns	`1.02`
`cuda/synchronization/stream/blocking`	`805.51` ns	`835.7402597402597` ns	`0.96`
`cuda/synchronization/stream/nonblocking`	`7708.299999999999` ns	`7438.2` ns	`1.04`
`integration/byval/reference`	`143965` ns	`144123` ns	`1.00`
`integration/byval/slices=1`	`145987` ns	`146064` ns	`1.00`
`integration/byval/slices=2`	`284683` ns	`284754` ns	`1.00`
`integration/byval/slices=3`	`423085` ns	`423302` ns	`1.00`
`integration/cudadevrt`	`102557` ns	`102654` ns	`1.00`
`integration/volumerhs`	`9442248.5` ns	`9450427` ns	`1.00`
`kernel/indexing`	`13392` ns	`13382` ns	`1.00`
`kernel/indexing_checked`	`14086` ns	`14092` ns	`1.00`
`kernel/launch`	`2201.3333333333335` ns	`2292.8888888888887` ns	`0.96`
`kernel/occupancy`	`659.3625` ns	`675.4013157894736` ns	`0.98`
`kernel/rand`	`17118` ns	`17995` ns	`0.95`
`latency/import`	`3825771360.5` ns	`3823445090` ns	`1.00`
`latency/precompile`	`4583776029` ns	`4598939035` ns	`1.00`
`latency/ttfp`	`4395271001` ns	`4399692793` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

Add override for muladd and use LLVM intrinsic for fma

95df3a5

github-actions bot reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add override for muladd and use LLVM intrinsic for fma#3078

Add override for muladd and use LLVM intrinsic for fma#3078
vchuravy wants to merge 1 commit intomasterfrom
vc/faster_muladd

vchuravy commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vchuravy commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Codecov Report

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant