Skip to content

Performance issue when doing scalar add for non-traced array #2074

@ptiede

Description

@ptiede

I am seeing a massive performance hit when performing a reduction with a scalar traced variable and a rather large non-traced array. Interestingly, if instead of doing a mapreduce I split the map and then a reduction, performance rebounds.

MWE

using Reactant 
using BenchmarkTools

function testmapreduce(a, A)
    return sum(A) do Ax
        log(a + Ax)
    end
end

function testmap_reduce(a, A)
    return sum(log.(a .+ A))
end

ρr = ConcreteRNumber(2.0)
x = rand(16, 16)

f1 = @compile sync=true testmapreduce(ρr, x)
f2 = @compile sync=true testmap_reduce(ρr, x)



@benchmark f1($ρr, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  175.838 μs   6.051 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     185.833 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   198.689 μs ± 87.604 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▇▆▅▄▃▂▁                           ▁▁    ▁                 ▂
  ████████████▇▆▄▃▅▅▃▄▄▁▁▁▁▃▄▃▄▃▃▁▃▄▅▅▇███████▇██▇▇▇▆▇▇▆▆▅▆▅▆▅ █
  176 μs        Histogram: log(frequency) by time       379 μs <

 Memory estimate: 704 bytes, allocs estimate: 16.


@benchmark f2($ρr, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  29.210 μs   11.455 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     42.670 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.200 μs ± 154.119 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▇█▅▄▁                                                      ▁
  ▇▇█████▇▅▅▅▄▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▁▃▃▃▃▅▆ █
  29.2 μs       Histogram: log(frequency) by time       259 μs <

 Memory estimate: 704 bytes, allocs estimate: 16.

This seems to get worse and worse as the array x gets bigger.
Looking at @code_hlo testmapreduce appears to be getting unrolled into a giant function, and the array x is getting split into a bunch of 1-element tensors. I don't understand Reactant's internals, so I have no idea why it chose to split the array.

Compute environment

Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1
  JULIA_NUM_THREADS = 1

Project.toml

  [6e4b80f9] BenchmarkTools v1.6.3
  [3c362404] Reactant v0.2.193

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions