Skip to content

Integrate CliqueTrees.jl#878

Open
samuelsonric wants to merge 4 commits intoJuliaStats:mainfrom
samuelsonric:main
Open

Integrate CliqueTrees.jl#878
samuelsonric wants to merge 4 commits intoJuliaStats:mainfrom
samuelsonric:main

Conversation

@samuelsonric
Copy link

@samuelsonric samuelsonric commented Feb 18, 2026

This PR adds CliqueTrees.jl as a backend for computing Cholesky factorizations. It addresses this issue.

AlgebraicJulia/CliqueTrees.jl#14

Here is how to use the new functionality.

using LinearAlgebra
using CliqueTrees
using MixedModels                                                                                       
using MixedModels: dataset, updateL!, _init, _transform!, _logdet                                               

import Metis

M = fit(MixedModel,                                                                                      
    @formula(y ~ 1 + service + (1 | s) + (1 | d)),
    MixedModels.dataset(:insteval);
    progress=false)

updateL!(M)
logdet1 = logdet(M)

F, W = _init(M; alg=METIS())
cholesky!(_transform!(M, F, W))
logdet2 = _logdet(M, F)

The generic solver in CliqueTrees.jl struggles a bit with the unusual problem topology, and it is slower than the custom solver in MixedModels.jl. I am going to try addressing this upstream.

julia> @btime updateL!(M);
  5.006 ms (10 allocations: 240 bytes)

julia> @btime cholesky!(_transform!(M, F, W));
  21.974 ms (38 allocations: 15.82 MiB)

@codecov
Copy link

codecov bot commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 0% with 417 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.93%. Comparing base (b5ba3ec) to head (90072df).

Files with missing lines Patch % Lines
src/cliquetrees.jl 0.00% 417 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #878      +/-   ##
==========================================
- Coverage   95.60%   85.93%   -9.67%     
==========================================
  Files          38       39       +1     
  Lines        3706     4123     +417     
==========================================
  Hits         3543     3543              
- Misses        163      580     +417     
Flag Coverage Δ
current 85.58% <0.00%> (-9.70%) ⬇️
minimum 85.88% <0.00%> (-9.67%) ⬇️
nightly 85.58% <0.00%> (-9.70%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@palday
Copy link
Member

palday commented Feb 19, 2026

@samuelsonric Thanks! This looks very interesting.

@dmbates We can go ahead and merge this as-is, then I can take a stab at structuring this as a package extension. Relatedly, should we do this against main or should we do this against the gradients experiments branch?

@dmbates
Copy link
Collaborator

dmbates commented Feb 20, 2026

@palday I think it would be best to merge as-is. I am having difficulties with the gradients branch and would not want to predict when I will be able to resolve them.

@samuelsonric
Copy link
Author

A little better now.

julia> @btime cholesky!(_transform!(M, F, W));
  18.942 ms (17 allocations: 7.57 MiB)

@samuelsonric
Copy link
Author

@dmbates @palday I implemented gradient computation (via selected inversion).

julia> using MixedModels

julia> using MixedModels: _init, _objective_gradient

julia> M = fit(MixedModel,
          @formula(y ~ 1 + service + (1 | s) + (1 | d)),
          MixedModels.dataset(:insteval);
          progress=false);

julia> F, W = _init(M);

julia> @time obj, grad = _objective_gradient(M, F, W)
  0.064559 seconds (151 allocations: 21.424 MiB)
(851977.1778855723, [0.00019941088066843804, 8.61645912664244e-6])

@dmbates
Copy link
Collaborator

dmbates commented Mar 2, 2026

@palday @samuelsonric I embarrassed to need to ask this but how do I check out a version of the package with these changes before merging? Do I just clone the @samuelsonric fork of the repository?

@samuelsonric
Copy link
Author

@dmbates:

git fetch origin pull/878/head:pr-878
git checkout pr-878

@dmbates
Copy link
Collaborator

dmbates commented Mar 2, 2026

It did work to clone the fork of the repository.

The example is a good beginning to show the effectiveness of the approach. Unfortunately, that particular dataset/model combination is just too fast in the blocked evaluation, derivative-free approach. It's a two-parameter model that converges in 42 evaluations of the objective and on my laptop takes less than 5 ms per evaluation.

julia> @be objective(updateL!($M))    # @be is from Chairmarks.jl
Benchmark: 24 samples with 1 evaluation
 min    4.301 ms (15 allocs: 320 bytes)
 median 4.316 ms (15 allocs: 320 bytes)
 mean   4.322 ms (15 allocs: 320 bytes)
 max    4.404 ms (15 allocs: 320 bytes)

Overall the optimization (after constructing the LinearMixedModel object) takes about 1/5 of the second and less than 25 KB of additional storage allocation.

julia> @be refit!($M; progress=false)  seconds=5
Benchmark: 25 samples with 1 evaluation
 min    203.901 ms (996 allocs: 23.359 KiB)
 median 204.191 ms (996 allocs: 23.359 KiB)
 mean   204.709 ms (996 allocs: 23.359 KiB)
 max    207.378 ms (996 allocs: 23.359 KiB)

Evaluation of the objective and gradient through the sparse Cholesky takes about 75 ms on my computer with 21 MB of storage allocated.

julia> @be _objective_gradient($M, $F, $W)  seconds=5
Benchmark: 62 samples with 1 evaluation
 min    75.534 ms (137 allocs: 21.334 MiB)
 median 77.777 ms (137 allocs: 21.334 MiB, 0.91% gc time)
 mean   81.691 ms (137 allocs: 21.334 MiB, 3.59% gc time)
 max    147.840 ms (137 allocs: 21.334 MiB, 46.87% gc time)

This is a great contribution and I very much appreciate your work. I am simply pointing out that the blocked approach, which has been honed over a long period of time, is hard to beat.

@dmbates
Copy link
Collaborator

dmbates commented Mar 2, 2026

@dmbates:

git fetch origin pull/878/head:pr-878
git checkout pr-878

Thank you.

@dmbates
Copy link
Collaborator

dmbates commented Mar 3, 2026

@samuelsonric I'm wondering if we can provide a utility to make generating a symmetric SparseMatrixCSC version of A easier for you. There is already a sparseL function which returns a LowerTriangular SparseMatrixCSC version of the Cholesky factor L with optional arguments to include or exclude the fixed-effects and response rows and to start from an arbitrary grouping factor. The sparsity pattern of A and Ω = Λ'A Λ + I will be identical. Would having a sparseA function be of use?

@samuelsonric
Copy link
Author

Hello @dmbates -- yes, it would be helpful. I think that most of the code is just constructing the sparse matrix

@dmbates
Copy link
Collaborator

dmbates commented Mar 3, 2026

Hello @dmbates -- yes, it would be helpful. I think that most of the code is just constructing the sparse matrix

Great. I will do that in a sparseA branch, also possibly including the change mentioned in #879 if I hear back from @palday

Are there any optional arguments that would help? sparseL has a full::Bool=true argument that determines if only the random effects parts are returned or if the fixed-effects/response rows are appended. sparseL also can take a symbol representing a grouping factor name at which to start but I never use that.

@dmbates dmbates mentioned this pull request Mar 3, 2026
3 tasks
@dmbates
Copy link
Collaborator

dmbates commented Mar 3, 2026

@samuelsonric Can you take a look at the sparseA function defined in the db/sparseA branch (#880) and see if that would meet your needs? It returns a lower-triangular sparseCSC array that would generally be wrapped as Symmetric(sparseA(...), :L). In case your code just uses the lower triangle I didn't wrap the result in Symmetric.

@samuelsonric
Copy link
Author

samuelsonric commented Mar 4, 2026

Hello @dmbates. There is a complication. The ChordalWorkspace object has two fields: blocks and indices.

struct ChordalWorkspace{T}
    blocks::Vector{AbstractMatrix{T}}
    indices::Vector{Vector{Int}}
end

The first field a workspace; the second is a mapping from the structural nonzeros of M.A into the structural nonzeros of the Cholesky factor. This mapping is then used at various points to relate the two objects.

We build this map in two steps.

  1. First, as well build the sparse matrix S, we also construct a mapping from the nonzeros of M.A into the nonzeros of S.
  2. Next, CliqueTrees.jl provides a mapping from the nonzeros of S into the nonzeros of the Cholesky factor.

incides is constructed by composing the two mappings.

# construct mapping from blocks into Cholesky factor
P = flatindices(F, Symmetric(S, :L))

for blkptr in eachindex(indices)
    index = indices[blkptr]
      
    for i in eachindex(index)
        index[i] = P[index[i]]
    end
end

In order to use your sparseA function, I need the mapping as well.

S, indices = sparseA_with_indices(m; full=true) 

Otherwise, the function looks fine. We only use a triangle (lower or upper).

@dmbates
Copy link
Collaborator

dmbates commented Mar 4, 2026

@samuelsonric Just so that I am clear, indices should be a vector of integer vectors mapping the non-zeros of each block in A into the positions in rowval and nzval of the SparseMatrixCSC representation S. The length of indices is the number of blocks in the lower triangle of A and the length of each of the elements of indices is the number of non-zeros in the corresponding block of A. Is that correct?

@samuelsonric
Copy link
Author

@dmbates That is it exactly. Nonzeros in this sense:

_nonzeros(A::Diagonal) = A.diag
_nonzeros(A::Matrix) = vec(A)
_nonzeros(A::UniformBlockDiagonal) = vec(A.data)
_nonzeros(A::BlockedSparse) = nonzeros(A.cscmat)
_nonzeros(A::SparseMatrixCSC) = nonzeros(A)

Also, a vector of vectors is not the most performant possible representation. You could alternatively return a sparse matrix X such that

view(rowvals(X), nzrange(X, j)) == indices[j]

for all blocks j.

@dmbates
Copy link
Collaborator

dmbates commented Mar 6, 2026

It seems that for models with a vector-valued random effects term the evaluation of both the objective and the gradient is incorrect.

julia> using Chairmarks, ForwardDiff, MixedModels

julia> using MixedModels: _init, _objective_gradient

julia> include("test/modelcache.jl");

julia> M = last(models(:sleepstudy))
Linear mixed model fit by maximum likelihood
 reaction ~ 1 + days + (1 + days | subj)
   logLik   -2 logLik     AIC       AICc        BIC    
  -875.9697  1751.9393  1763.9393  1764.4249  1783.0971

Variance components:
            Column    Variance Std.Dev.   Corr.
subj     (Intercept)  565.52074 23.78068
         days          32.68242  5.71685 +0.08
Residual              654.94015 25.59180
 Number of obs: 180; levels of grouping factors: 18

  Fixed-effects parameters:
──────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)
──────────────────────────────────────────────────
(Intercept)  251.405      6.6323   37.91    <1e-99
days          10.4673     1.50224   6.97    <1e-11
──────────────────────────────────────────────────

julia> ForwardDiff.gradient(M)
3-element Vector{Float64}:
  0.00014883250043595808
 -0.00027073512577757697
  0.0005646588063541458

julia> F, W = _init(M);

julia> _objective_gradient(M, F, W)
(2175.853705670591, [31.091245400405885, -665.5156915846187, 145.82614684502275])

I haven't looked in detail at the evaluation to see where things might be going wrong and I am not sure it will be worthwhile doing extensive debugging.

We appreciate your effort in exploring this approach with us but, before we go much further down this road, we might consider where using CliqueTrees would fit in with MixedModels. Our goal in MixedModels is to allow for defining, fitting, and post-fit analysis of such models in a performant and space-efficient manner. We're at the point now where the smaller examples that appear in textbooks are fit sufficiently quickly that we don't need to optimize that. When we get to large data sets with somewhat complex models, such as the insteval and movielens examples in https://arxiv.org/pdf/2505.11674, we have to consider both time and memory usage in the fitting process.

In a sense our reordering of the random-effects terms is kind of a "poor man's multifrontal" approach because it re-orders the rows and columns of A to reduce the number of non-zeros in L, but only on the block level. If a fill-reducing permutation of S and multi-frontal decomposition could do much better in terms of time or memory usage it would be worthwhile, but I am not sure that is going to be the case. The memory overhead for the various data structures is going to require a lot of memory reduction from a fill-reducing permutation to offset it.

With regard to gradient evaluation, it is not really necessary for models with few parameters, which typically converge in a few hundred evaluations of the objective at most, and where the objective evaluation is fast. @palday wrote MixedModels.jl extensions using ForwardDiff.jl, shown above, and FiniteDiff.jl that can evaluate the gradient. We haven't found cases where using a gradient-based optimizer makes things faster. A gradient-based optimizer may converge in fewer evaluations of the objective but not sufficiently fewer to offset the difference in evaluation of the objective and gradient versus objective only. @palday has submitted an abstract to JuliaCon2026 for a presentation discussing this. You can see in the earlier example in this thread with the @be results that evaluating the objective and gradient takes over 10 times as long and also much more memory than evaluating just the objective. The best that we have been able to do in reducing the number of evaluations using gradient-based optimization methods is about 2/3 of those required for derivative-free optimization and we would need to do much better than that to offset the increase in time per evaluation.

@samuelsonric
Copy link
Author

Hi @dmbates. If the perfomance is too bad, then there is no point I guess. But it is my strong suspicion that performance can be significantly improved. There are strategies for dealing with topologies like this, and CHOLMOD handles your matrices very very well.

Another option is to implement selinv directly, using your block structure.

@dmbates
Copy link
Collaborator

dmbates commented Mar 8, 2026

@samuelsonric Thank you again for your contribution. You mentioned implementing selinv directly using the block structure, which sounds like an interesting idea. I have been trying to do some reading on selective inversion but I may not be going about this effectively. Do you have recommendations of where I should start?

It may be my lack of knowledge or imagination but I am not sure how the selective inversion leads to the evaluation of
tr(Ωinv ∂Ω/∂Λij). Am I missing an obvious step here?

@samuelsonric
Copy link
Author

It may be my lack of knowledge or imagination but I am not sure how the selective inversion leads to the evaluation of
tr(Ωinv ∂Ω/∂Λij). Am I missing an obvious step here?

@dmbates the matrix

$$\frac{\partial \Omega}{\partial \Lambda_{ij}}$$

has the same sparsity pattern as $\Omega$. Hence, we only need to evaluate $\Omega^{-1}$ at the structural nonzeros of $\Omega$.

@dmbates
Copy link
Collaborator

dmbates commented Mar 9, 2026

@samuelsonric Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants