-
Notifications
You must be signed in to change notification settings - Fork 51
Closed
Description
I am seeing a massive performance hit when performing a reduction with a scalar traced variable and a rather large non-traced array. Interestingly, if instead of doing a mapreduce I split the map and then a reduction, performance rebounds.
MWE
using Reactant
using BenchmarkTools
function testmapreduce(a, A)
return sum(A) do Ax
log(a + Ax)
end
end
function testmap_reduce(a, A)
return sum(log.(a .+ A))
end
ρr = ConcreteRNumber(2.0)
x = rand(16, 16)
f1 = @compile sync=true testmapreduce(ρr, x)
f2 = @compile sync=true testmap_reduce(ρr, x)
@benchmark f1($ρr, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 175.838 μs … 6.051 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 185.833 μs ┊ GC (median): 0.00%
Time (mean ± σ): 198.689 μs ± 87.604 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄██▇▆▅▄▃▂▁ ▁▁ ▁ ▂
████████████▇▆▄▃▅▅▃▄▄▁▁▁▁▃▄▃▄▃▃▁▃▄▅▅▇███████▇██▇▇▇▆▇▇▆▆▅▆▅▆▅ █
176 μs Histogram: log(frequency) by time 379 μs <
Memory estimate: 704 bytes, allocs estimate: 16.
@benchmark f2($ρr, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 29.210 μs … 11.455 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 42.670 μs ┊ GC (median): 0.00%
Time (mean ± σ): 48.200 μs ± 154.119 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▅▄▁ ▁
▇▇█████▇▅▅▅▄▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▁▃▃▃▃▅▆ █
29.2 μs Histogram: log(frequency) by time 259 μs <
Memory estimate: 704 bytes, allocs estimate: 16.This seems to get worse and worse as the array x gets bigger.
Looking at @code_hlo testmapreduce appears to be getting unrolled into a giant function, and the array x is getting split into a bunch of 1-element tensors. I don't understand Reactant's internals, so I have no idea why it chose to split the array.
Compute environment
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
JULIA_EDITOR = code
JULIA_VSCODE_REPL = 1
JULIA_NUM_THREADS = 1Project.toml
[6e4b80f9] BenchmarkTools v1.6.3
[3c362404] Reactant v0.2.193Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels