Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #878 +/- ##
==========================================
- Coverage 95.60% 85.93% -9.67%
==========================================
Files 38 39 +1
Lines 3706 4123 +417
==========================================
Hits 3543 3543
- Misses 163 580 +417
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@samuelsonric Thanks! This looks very interesting. @dmbates We can go ahead and merge this as-is, then I can take a stab at structuring this as a package extension. Relatedly, should we do this against main or should we do this against the gradients experiments branch? |
|
@palday I think it would be best to merge as-is. I am having difficulties with the gradients branch and would not want to predict when I will be able to resolve them. |
|
A little better now. julia> @btime cholesky!(_transform!(M, F, W));
18.942 ms (17 allocations: 7.57 MiB) |
|
@dmbates @palday I implemented gradient computation (via selected inversion). julia> using MixedModels
julia> using MixedModels: _init, _objective_gradient
julia> M = fit(MixedModel,
@formula(y ~ 1 + service + (1 | s) + (1 | d)),
MixedModels.dataset(:insteval);
progress=false);
julia> F, W = _init(M);
julia> @time obj, grad = _objective_gradient(M, F, W)
0.064559 seconds (151 allocations: 21.424 MiB)
(851977.1778855723, [0.00019941088066843804, 8.61645912664244e-6]) |
|
@palday @samuelsonric I embarrassed to need to ask this but how do I check out a version of the package with these changes before merging? Do I just clone the @samuelsonric fork of the repository? |
|
|
It did work to clone the fork of the repository. The example is a good beginning to show the effectiveness of the approach. Unfortunately, that particular dataset/model combination is just too fast in the blocked evaluation, derivative-free approach. It's a two-parameter model that converges in 42 evaluations of the objective and on my laptop takes less than 5 ms per evaluation. julia> @be objective(updateL!($M)) # @be is from Chairmarks.jl
Benchmark: 24 samples with 1 evaluation
min 4.301 ms (15 allocs: 320 bytes)
median 4.316 ms (15 allocs: 320 bytes)
mean 4.322 ms (15 allocs: 320 bytes)
max 4.404 ms (15 allocs: 320 bytes)Overall the optimization (after constructing the julia> @be refit!($M; progress=false) seconds=5
Benchmark: 25 samples with 1 evaluation
min 203.901 ms (996 allocs: 23.359 KiB)
median 204.191 ms (996 allocs: 23.359 KiB)
mean 204.709 ms (996 allocs: 23.359 KiB)
max 207.378 ms (996 allocs: 23.359 KiB)Evaluation of the objective and gradient through the sparse Cholesky takes about 75 ms on my computer with 21 MB of storage allocated. julia> @be _objective_gradient($M, $F, $W) seconds=5
Benchmark: 62 samples with 1 evaluation
min 75.534 ms (137 allocs: 21.334 MiB)
median 77.777 ms (137 allocs: 21.334 MiB, 0.91% gc time)
mean 81.691 ms (137 allocs: 21.334 MiB, 3.59% gc time)
max 147.840 ms (137 allocs: 21.334 MiB, 46.87% gc time)This is a great contribution and I very much appreciate your work. I am simply pointing out that the blocked approach, which has been honed over a long period of time, is hard to beat. |
Thank you. |
|
@samuelsonric I'm wondering if we can provide a utility to make generating a symmetric SparseMatrixCSC version of A easier for you. There is already a |
|
Hello @dmbates -- yes, it would be helpful. I think that most of the code is just constructing the sparse matrix |
Great. I will do that in a Are there any optional arguments that would help? |
|
@samuelsonric Can you take a look at the |
|
Hello @dmbates. There is a complication. The The first field a workspace; the second is a mapping from the structural nonzeros of We build this map in two steps.
# construct mapping from blocks into Cholesky factor
P = flatindices(F, Symmetric(S, :L))
for blkptr in eachindex(indices)
index = indices[blkptr]
for i in eachindex(index)
index[i] = P[index[i]]
end
endIn order to use your S, indices = sparseA_with_indices(m; full=true) Otherwise, the function looks fine. We only use a triangle (lower or upper). |
|
@samuelsonric Just so that I am clear, |
|
@dmbates That is it exactly. Nonzeros in this sense: Also, a vector of vectors is not the most performant possible representation. You could alternatively return a sparse matrix view(rowvals(X), nzrange(X, j)) == indices[j]for all blocks |
|
It seems that for models with a vector-valued random effects term the evaluation of both the objective and the gradient is incorrect. julia> using Chairmarks, ForwardDiff, MixedModels
julia> using MixedModels: _init, _objective_gradient
julia> include("test/modelcache.jl");
julia> M = last(models(:sleepstudy))
Linear mixed model fit by maximum likelihood
reaction ~ 1 + days + (1 + days | subj)
logLik -2 logLik AIC AICc BIC
-875.9697 1751.9393 1763.9393 1764.4249 1783.0971
Variance components:
Column Variance Std.Dev. Corr.
subj (Intercept) 565.52074 23.78068
days 32.68242 5.71685 +0.08
Residual 654.94015 25.59180
Number of obs: 180; levels of grouping factors: 18
Fixed-effects parameters:
──────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
──────────────────────────────────────────────────
(Intercept) 251.405 6.6323 37.91 <1e-99
days 10.4673 1.50224 6.97 <1e-11
──────────────────────────────────────────────────
julia> ForwardDiff.gradient(M)
3-element Vector{Float64}:
0.00014883250043595808
-0.00027073512577757697
0.0005646588063541458
julia> F, W = _init(M);
julia> _objective_gradient(M, F, W)
(2175.853705670591, [31.091245400405885, -665.5156915846187, 145.82614684502275])I haven't looked in detail at the evaluation to see where things might be going wrong and I am not sure it will be worthwhile doing extensive debugging. We appreciate your effort in exploring this approach with us but, before we go much further down this road, we might consider where using CliqueTrees would fit in with MixedModels. Our goal in MixedModels is to allow for defining, fitting, and post-fit analysis of such models in a performant and space-efficient manner. We're at the point now where the smaller examples that appear in textbooks are fit sufficiently quickly that we don't need to optimize that. When we get to large data sets with somewhat complex models, such as the insteval and movielens examples in https://arxiv.org/pdf/2505.11674, we have to consider both time and memory usage in the fitting process. In a sense our reordering of the random-effects terms is kind of a "poor man's multifrontal" approach because it re-orders the rows and columns of A to reduce the number of non-zeros in L, but only on the block level. If a fill-reducing permutation of S and multi-frontal decomposition could do much better in terms of time or memory usage it would be worthwhile, but I am not sure that is going to be the case. The memory overhead for the various data structures is going to require a lot of memory reduction from a fill-reducing permutation to offset it. With regard to gradient evaluation, it is not really necessary for models with few parameters, which typically converge in a few hundred evaluations of the objective at most, and where the objective evaluation is fast. @palday wrote |
|
Hi @dmbates. If the perfomance is too bad, then there is no point I guess. But it is my strong suspicion that performance can be significantly improved. There are strategies for dealing with topologies like this, and CHOLMOD handles your matrices very very well. Another option is to implement selinv directly, using your block structure. |
|
@samuelsonric Thank you again for your contribution. You mentioned implementing selinv directly using the block structure, which sounds like an interesting idea. I have been trying to do some reading on selective inversion but I may not be going about this effectively. Do you have recommendations of where I should start? It may be my lack of knowledge or imagination but I am not sure how the selective inversion leads to the evaluation of |
@dmbates the matrix has the same sparsity pattern as |
|
@samuelsonric Thank you. |
This PR adds CliqueTrees.jl as a backend for computing Cholesky factorizations. It addresses this issue.
Here is how to use the new functionality.
The generic solver in CliqueTrees.jl struggles a bit with the unusual problem topology, and it is slower than the custom solver in MixedModels.jl. I am going to try addressing this upstream.