Skip to content

Latency #352

@willtebbutt

Description

@willtebbutt

With all of the exciting stuff happening with code caching in 1.9, I thought I'd take a look at our latency for some common tasks.

Consider the following code:

using Pkg
pkg"activate ."

# package loading
@time using AbstractGPs, KernelFunctions, Random, LinearAlgebra

# first evaluation
@time begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)

    f_post = posterior(fx, y)
    y_post = rand(f_post(x, 0.1))
end;

# second evaluation
@time begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)

    f_post = posterior(fx, y)
    y_post = rand(f_post(x, 0.1))
end;

It times package load times, and 1st / 2nd evaluation times of some pretty standard AbstractGPs code.
On 1.9, I see the following results:

# package loading
 1.036674 seconds (1.77 M allocations: 115.412 MiB, 3.65% gc time, 17.33% compilation time)

# first evaluation
 1.934089 seconds (3.70 M allocations: 251.913 MiB, 6.21% gc time, 141.09% compilation time)

# second evaluation
 0.000115 seconds (61 allocations: 112.578 KiB)

Overall, this doesn't seem too bad.

However, we're not taking advantage of pre-compilation anywhere within the JuliaGPs ecosystem, so I wanted to know what would happen if we tried that. To this end, I added the following precompile statements in AbstractGPs:

kernels = [SEKernel(), Matern12Kernel(), Matern32Kernel(), Matern52Kernel()]
xs = [ColVecs(randn(2, 3)), RowVecs(randn(3, 2)), randn(3)]
for k in kernels, x in xs
    precompile(kernelmatrix, (typeof(k), typeof(x)))
    precompile(kernelmatrix, (typeof(k), typeof(x), typeof(x)))
end

for x in xs
    precompile(
        _posterior_computations,
        (typeof(Zeros(5)), Matrix{Float64}, typeof(x), Vector{Float64}),
    )
end

# Pre-compile various AbstractGPs-specific things.
precompile(Diagonal, (typeof(Fill(0.1, 10)), ))
precompile(_rand, (typeof(Random.GLOBAL_RNG), typeof(Zeros(5)), Matrix{Float64}, Int))
precompile(_rand, (Xoshiro, typeof(Zeros(5)), Matrix{Float64}, Int))
precompile(_logpdf, (typeof(Zeros(5)), Matrix{Float64}, Vector{Float64}))

I've tried to add only pre-compile statements for low-level code that doesn't get involved in combinations of things. For example, I don't think it makes sense to add a pre-compile statement for kernelmatrix for a sum of kernels because you'd have to compile a separate method instance for each collection of pairs of kernel types that you ever encountered, and I want to avoid a combinatorial explosion.

_logpdf, _rand_, and _posterior_computations are bits of code I've pulled out of logpdf, rand and _posterior_computations which are GP-independent. i.e. they just depend on matrix types etc. This feels fair, because they don't need to be re-compiled for every new kernel that's used, just when the output of kernelmatrix isn't a Matrix{Float64} or whatever.

Anyway, the results are:

# code loading
 1.083321 seconds (1.89 M allocations: 123.372 MiB, 3.33% gc time, 16.66% compilation time)

# first execution
0.466714 seconds (957.73 k allocations: 65.151 MiB, 3.31% gc time, 137.49% compilation time)

# second execution
 0.000111 seconds (61 allocations: 112.578 KiB)

So it looks like by pre-compiling, we can get a really substantial 4x reduction in time-to-first-inference, or whatever we're calling it.

If you use the slightly more complicated kernel

0.1 * SEKernel() + 0.5 * Matern32Kernel()

you see (without pre-compilation):

# first evaluation
2.236816 seconds (4.05 M allocations: 275.188 MiB, 5.90% gc time, 139.38% compilation time)

# second evaluation
0.000161 seconds (100 allocations: 190.422 KiB)

With pre-compilation you see something like:

# first execution
0.553095 seconds (1.06 M allocations: 72.149 MiB, 5.66% gc time, 138.60% compilation time)

# second execution
0.000157 seconds (95 allocations: 241.000 KiB)

So here we see a similar performance boost because we've pre-compiled all of the code to compute the kernelmatrices for the SEKernel and the Matern32Kernel, so the compiler only the code for kernelmatrix of their sum needs to be compiled on the fly.

It does look like there's a small penalty paid in load time, but I think it might typically be outweighted substantially by the compilation savings.

I wonder whether there's a case for adding a for-loop to KernelFunctions that pre-compiles the kernelmatrix, kernelmatrix_diag etc methods for each "simple" kernel, where by "simple" I basically just mean anything that's not a composite kernel, and adding the kinds of method I've discussed above to AbstractGPs. It might make the user experience substantially more pleasant 🤷 . I for one would love to have these savings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions