-
Notifications
You must be signed in to change notification settings - Fork 98
Open
Description
I found the pairwise with SqEuclidean is faster than my own implementation on CPU but slower on GPU. Any idea why and possible optimization on Distances.jl side?
MWE:
using Test, BenchmarkTools, Distances, CuArrays
function pairwise_dot_kai(x)
d, n = size(x)
xixj = x' * x
xsq = sum(x .^ 2; dims=1)
return repeat(xsq, n, 1) + repeat(xsq', 1, n) - 2xixj
end
pairwise_dot(x) = pairwise(SqEuclidean(), x; dims=2)
xbench = randn(Float32, 784, 200);
@benchmark pairwise_dot_kai(xbench)BenchmarkTools.Trial:
memory estimate: 1.52 MiB
allocs estimate: 17
--------------
minimum time: 854.227 μs (0.00% GC)
median time: 1.183 ms (0.00% GC)
mean time: 1.361 ms (12.37% GC)
maximum time: 125.259 ms (98.46% GC)
--------------
samples: 3662
evals/sample: 1
@benchmark pairwise_dot(xbench)BenchmarkTools.Trial:
memory estimate: 166.59 KiB
allocs estimate: 204
--------------
minimum time: 359.751 μs (0.00% GC)
median time: 406.615 μs (0.00% GC)
mean time: 458.925 μs (6.46% GC)
maximum time: 104.066 ms (99.27% GC)
--------------
samples: 10000
evals/sample: 1
xbench = xbench |> cu;
@benchmark pairwise_dot_kai(xbench)BenchmarkTools.Trial:
memory estimate: 1.20 MiB
allocs estimate: 19424
--------------
minimum time: 19.042 ms (0.00% GC)
median time: 20.028 ms (0.00% GC)
mean time: 21.811 ms (3.62% GC)
maximum time: 52.425 ms (38.23% GC)
--------------
samples: 230
evals/sample: 1
@benchmark pairwise_dot(xbench)BenchmarkTools.Trial:
memory estimate: 10.99 MiB
allocs estimate: 240635
--------------
minimum time: 453.229 ms (0.00% GC)
median time: 470.074 ms (0.00% GC)
mean time: 474.353 ms (2.67% GC)
maximum time: 499.969 ms (6.04% GC)
--------------
samples: 11
evals/sample: 1
Metadata
Metadata
Assignees
Labels
No labels