Latency

With all of the exciting stuff happening with code caching in 1.9, I thought I'd take a look at our latency for some common tasks.

Consider the following code:
```julia
using Pkg
pkg"activate ."

# package loading
@time using AbstractGPs, KernelFunctions, Random, LinearAlgebra

# first evaluation
@time begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)

    f_post = posterior(fx, y)
    y_post = rand(f_post(x, 0.1))
end;

# second evaluation
@time begin
    X = randn(5, 25)
    x = ColVecs(X)
    f = GP(SEKernel())
    fx = f(x, 0.1)
    y = rand(fx)
    logpdf(fx, y)

    f_post = posterior(fx, y)
    y_post = rand(f_post(x, 0.1))
end;
```
It times package load times, and 1st / 2nd evaluation times of some pretty standard AbstractGPs code.
On 1.9, I see the following results:
```julia
# package loading
 1.036674 seconds (1.77 M allocations: 115.412 MiB, 3.65% gc time, 17.33% compilation time)

# first evaluation
 1.934089 seconds (3.70 M allocations: 251.913 MiB, 6.21% gc time, 141.09% compilation time)

# second evaluation
 0.000115 seconds (61 allocations: 112.578 KiB)
```
Overall, this doesn't seem too bad.

However, we're not taking advantage of pre-compilation _anywhere_ within the JuliaGPs ecosystem, so I wanted to know what would happen if we tried that. To this end, I added the following `precompile` statements in AbstractGPs:
```julia
kernels = [SEKernel(), Matern12Kernel(), Matern32Kernel(), Matern52Kernel()]
xs = [ColVecs(randn(2, 3)), RowVecs(randn(3, 2)), randn(3)]
for k in kernels, x in xs
    precompile(kernelmatrix, (typeof(k), typeof(x)))
    precompile(kernelmatrix, (typeof(k), typeof(x), typeof(x)))
end

for x in xs
    precompile(
        _posterior_computations,
        (typeof(Zeros(5)), Matrix{Float64}, typeof(x), Vector{Float64}),
    )
end

# Pre-compile various AbstractGPs-specific things.
precompile(Diagonal, (typeof(Fill(0.1, 10)), ))
precompile(_rand, (typeof(Random.GLOBAL_RNG), typeof(Zeros(5)), Matrix{Float64}, Int))
precompile(_rand, (Xoshiro, typeof(Zeros(5)), Matrix{Float64}, Int))
precompile(_logpdf, (typeof(Zeros(5)), Matrix{Float64}, Vector{Float64}))
```
I've tried to add only pre-compile statements for low-level code that doesn't get involved in combinations of things. For example, I don't think it makes sense to add a pre-compile statement for `kernelmatrix` for a sum of kernels because you'd have to compile a separate method instance for each collection of pairs of kernel types that you ever encountered, and I want to avoid a combinatorial explosion.

`_logpdf`, `_rand_`, and `_posterior_computations` are bits of code I've pulled out of `logpdf`, `rand` and `_posterior_computations` which are `GP`-independent. i.e. they just depend on matrix types etc. This feels fair, because they don't need to be re-compiled for every new kernel that's used, just when the output of `kernelmatrix` isn't a `Matrix{Float64}` or whatever.

Anyway, the results are:
```julia
# code loading
 1.083321 seconds (1.89 M allocations: 123.372 MiB, 3.33% gc time, 16.66% compilation time)

# first execution
0.466714 seconds (957.73 k allocations: 65.151 MiB, 3.31% gc time, 137.49% compilation time)

# second execution
 0.000111 seconds (61 allocations: 112.578 KiB)
```
So it looks like by pre-compiling, we can get a really substantial 4x reduction in time-to-first-inference, or whatever we're calling it.


If you use the slightly more complicated kernel
```julia
0.1 * SEKernel() + 0.5 * Matern32Kernel()
```
you see (without pre-compilation):
```julia
# first evaluation
2.236816 seconds (4.05 M allocations: 275.188 MiB, 5.90% gc time, 139.38% compilation time)

# second evaluation
0.000161 seconds (100 allocations: 190.422 KiB)
```

With pre-compilation you see something like:
```julia
# first execution
0.553095 seconds (1.06 M allocations: 72.149 MiB, 5.66% gc time, 138.60% compilation time)

# second execution
0.000157 seconds (95 allocations: 241.000 KiB)
```
So here we see a similar performance boost because we've pre-compiled all of the code to compute the kernelmatrices for the `SEKernel` and the `Matern32Kernel`, so the compiler only the code for kernelmatrix of their sum needs to be compiled on the fly.

It does look like there's a small penalty paid in load time, but I think it might typically be outweighted substantially by the compilation savings.

I wonder whether there's a case for adding a for-loop to `KernelFunctions` that pre-compiles the `kernelmatrix`, `kernelmatrix_diag` etc methods for each "simple" kernel, where by "simple" I basically just mean anything that's not a composite kernel, and adding the kinds of method I've discussed above to `AbstractGPs`. It might make the user experience substantially more pleasant 🤷 . I for one would love to have these savings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latency #352

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Latency #352

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions