Skip to content

Fix big memory allocations in L1GS#238

Merged
termi-official merged 52 commits intomainfrom
l1gs-mem-fix
Feb 16, 2026
Merged

Fix big memory allocations in L1GS#238
termi-official merged 52 commits intomainfrom
l1gs-mem-fix

Conversation

@Abdelrahman912
Copy link
Collaborator

closes #230

@Abdelrahman912
Copy link
Collaborator Author

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            152ms /  99.9%           6.26MiB / 100.0%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    152ms  100.0%   152ms   6.26MiB  100.0%  6.26MiB
  kernel execution          1    100ms   65.5%   100ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   52.1ms   34.2%  52.1ms   5.22MiB   83.4%  5.22MiB
  isapprox(A, A')           1    464μs    0.3%   464μs   1.00MiB   16.0%  1.00MiB
  kernel setup              1   25.1μs    0.0%  25.1μs      128B    0.0%     128B
  allocate D_Dl1            1   24.6μs    0.0%  24.6μs   25.9KiB    0.4%  25.9KiB
  adapt(backend, _A)        1   40.0ns    0.0%  40.0ns     0.00B    0.0%    0.00B
  synchronize               1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 136ms /  83.0%           4.22MiB /  22.1%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         120    113ms  100.0%   942μs    955KiB  100.0%  7.96KiB
  _forward_sweep!                   120    113ms   99.7%   939μs    954KiB   99.9%  7.95KiB
    _forward_sweep! (internal)      120    113ms   99.6%   938μs    953KiB   99.8%  7.94KiB
      kernel call                   120    111ms   98.6%   929μs    936KiB   98.0%  7.80KiB
      kernel construction           120    722μs    0.6%  6.01μs   15.0KiB    1.6%     128B
      synchronize                   120   10.6μs    0.0%  87.9ns     0.00B    0.0%    0.00B
      convert size_A                120   4.75μs    0.0%  39.6ns     0.00B    0.0%    0.00B
      @unpack partitioning          120   4.34μs    0.0%  36.2ns     0.00B    0.0%    0.00B
      ndrange calc                  120   4.18μs    0.0%  34.9ns     0.00B    0.0%    0.00B
      @unpack P                     120   3.37μs    0.0%  28.0ns     0.00B    0.0%    0.00B
  y .= x                            120    208μs    0.2%  1.73μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 981, time: 1.05442742
Prec. no. iters: 119, time: 0.136057039

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            355ms /  99.8%           11.3MiB /  91.5%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    355ms  100.0%   355ms   10.3MiB  100.0%  10.3MiB
  kernel execution          1    349ms   98.3%   349ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   4.19ms    1.2%  4.19ms   7.42MiB   72.0%  7.42MiB
  isapprox(A, A')           1   1.71ms    0.5%  1.71ms   2.85MiB   27.6%  2.85MiB
  kernel setup              1   53.5μs    0.0%  53.5μs      128B    0.0%     128B
  allocate D_Dl1            1   15.2μs    0.0%  15.2μs   30.9KiB    0.3%  30.9KiB
  synchronize               1   37.0ns    0.0%  37.0ns     0.00B    0.0%    0.00B
  adapt(backend, _A)        1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 797ms /  62.8%           18.0MiB /  17.9%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         415    501ms  100.0%  1.21ms   3.21MiB  100.0%  7.93KiB
  _forward_sweep!                   415    499ms   99.8%  1.20ms   3.21MiB  100.0%  7.93KiB
    _forward_sweep! (internal)      415    499ms   99.7%  1.20ms   3.21MiB  100.0%  7.93KiB
      kernel call                   415    495ms   98.9%  1.19ms   3.16MiB   98.3%  7.80KiB
      kernel construction           415   2.63ms    0.5%  6.33μs   51.9KiB    1.6%     128B
      synchronize                   415   14.9μs    0.0%  36.0ns     0.00B    0.0%    0.00B
      convert size_A                415   12.3μs    0.0%  29.6ns     0.00B    0.0%    0.00B
      @unpack P                     415   12.1μs    0.0%  29.1ns     0.00B    0.0%    0.00B
      ndrange calc                  415   11.2μs    0.0%  27.0ns     0.00B    0.0%    0.00B
      @unpack partitioning          415   11.0μs    0.0%  26.6ns     0.00B    0.0%    0.00B
  y .= x                            415    796μs    0.2%  1.92μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 3067, time: 17.799613799
Prec. no. iters: 414, time: 0.796364037

@termi-official these are the outputs from the two examples in the test_preconditioners. There is the initial time which is done once and for all and there is the ldiv! which is called multiple times. the huge chunk of allocations comes from kernel constructions basically.

Also one thing that might be optimized is isapprox(A, A') but we need a prior info from the user I believe.

@codecov
Copy link

codecov bot commented Nov 29, 2025

Codecov Report

❌ Patch coverage is 95.43651% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.03%. Comparing base (b3b8649) to head (a1bbc86).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...c/solver/linear/preconditioners/l1_gauss_seidel.jl 96.58% 17 Missing ⚠️
ext/ThunderboltGPUArraysExt.jl 0.00% 4 Missing ⚠️
ext/cuda/cuda_operator.jl 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #238      +/-   ##
==========================================
+ Coverage   71.71%   73.03%   +1.32%     
==========================================
  Files          78       78              
  Lines        6515     6865     +350     
==========================================
+ Hits         4672     5014     +342     
- Misses       1843     1851       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating Abdelrahman! I left some comments on things which I catched on the fly.

Also take a look at CodeCov, it looks like not all of your code is covered in the CI.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Abdelrahman. There are some minor things left tho.

@AbdAlazezAhmed can you confirm that this works now on the ECG?

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some issues.

  1. ThreadedSparseMatrixCSC support is missing
  2. The preconditioner tanks the symmetry
  3. There is too much of memory usage

Lets try to make the following work

using Thunderbolt
using SparseArrays,OrderedCollections,TimerOutputs,LinearAlgebra

# TimerOutputs.enable_debug_timings(Thunderbolt)
# TimerOutputs.enable_debug_timings(Thunderbolt.Preconditioners)
geo = Hexahedron
nel = 2

size = 2.
signal_strength = 0.04
nel_heart = (6, 6, 6) .* 1 .* nel
nel_torso = (8, 8, 8) .* Int(size) .* nel
ground_vertex = Vec(0.0, 0.0, 0.0)
electrodes = [
    ground_vertex,
    Vec(-size,  0.,  0.), 
    Vec( size,  0.,  0.),
    Vec( 0., -size,  0.),
    Vec( 0.,  size,  0.),
    Vec( 0.,  0., -size),
    Vec( 0.,  0.,  size),
]
electrode_pairs = [[i,1] for i in 2:length(electrodes)]

heart_grid = generate_mesh(geo, nel_heart)
Ferrite.transform_coordinates!(heart_grid, x->Vec{3}(sign.(x) .* x.^2))

κ  = ConstantCoefficient(SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)))
κᵢ = AnalyticalCoefficient(
    (x,t) -> norm(x,Inf)  1.0 ? SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)) : SymmetricTensor{2,3,Float64}((0.0, 0, 0, 0.0, 0, 0.0)),
    CartesianCoordinateSystem{3}()
)

heart_model = TransientDiffusionModel(
    κᵢ,
    NoStimulationProtocol(), # Poisoning to detecte if we accidentally touch these
    :φₘ
)
heart_fun = semidiscretize(
    heart_model,
    FiniteElementDiscretization(
        Dict(:φₘ => LagrangeCollection{1}()),
        Dirichlet[]
    ),
    heart_grid
)

op = Thunderbolt.setup_assembled_operator(
    Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
    Thunderbolt.BilinearDiffusionIntegrator(
        κ,
        QuadratureRuleCollection(2),
        :φₘ,
    ),
    Thunderbolt.SparseMatrixCSC,
    heart_fun.dh,
)
Thunderbolt.update_operator!(op, 0.0) # trigger assembly


torso_grid_ = generate_grid(geo, nel_torso, Vec((-size,-size,-size)), Vec((size,size,size)))
addcellset!(torso_grid_, "heart", x->norm(x,Inf)  1.0)
# addcellset!(torso_grid_, "surrounding-tissue", x->norm(x,Inf) ≥ 1.0)
torso_grid_.cellsets["surrounding-tissue"] = OrderedSet([i for i in 1:getncells(torso_grid_) if i  torso_grid_.cellsets["heart"]])
torso_grid = Thunderbolt.to_mesh(torso_grid_)
u = zeros(Thunderbolt.solution_size(heart_fun))
geselowitz_electrodes = [[electrodes[1], electrodes[i]] for i in 2:length(electrodes)]

TimerOutputs.reset_timer!()
@info prod(nel_torso)
geselowitz_ecg = Thunderbolt.Geselowitz1989ECGLeadCache(
    heart_fun,
    torso_grid,
    κᵢ, κ,
    geselowitz_electrodes; 
    ground = OrderedSet([Thunderbolt.get_closest_vertex(ground_vertex, torso_grid)]),
    torso_heart_domain=["heart"],
    ipc                  = LagrangeCollection{1}(),
    qrc                  = QuadratureRuleCollection(3),
    linear_solver        = Thunderbolt.LinearSolve.KrylovJL_CG(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, SparseArrays.size(A,1)/8); isSymA=true), LinearAlgebra.I)),
    # linear_solver        = Thunderbolt.LinearSolve.KrylovJL_GMRES(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, 24)), LinearAlgebra.I)),
    system_matrix_type   = Thunderbolt.SparseMatrixCSR{Float64,Int64}, #Thunderbolt.ThreadedSparseMatrixCSR{Float64,Int64},
    strategy = Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
)

TimerOutputs.print_timer()

You need to apply these patches.

diff --git a/src/solver/interface.jl b/src/solver/interface.jl
index 8234949c..4279e657 100644
--- a/src/solver/interface.jl
+++ b/src/solver/interface.jl
@@ -148,13 +148,17 @@ update_constraints_block!(f::NullFunction, i::Block, solver_cache::AbstractTimeS
 
 create_system_matrix(T::Type{<:AbstractMatrix}, f::AbstractSemidiscreteFunction) = create_system_matrix(T, f.dh)
 
-function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR{Tv,Ti}}, dh::AbstractDofHandler) where {Tv,Ti}
-    Acsct = transpose(convert(SparseMatrixCSC{Tv,Ti}, allocate_matrix(dh)))
+function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR}, dh::AbstractDofHandler)
+    Acsct = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return ThreadedSparseMatrixCSR(Acsct)
 end
 
 function create_system_matrix(SpMatType::Type{<:SparseMatrixCSC}, dh::AbstractDofHandler)
-    A = convert(SpMatType, allocate_matrix(dh))
+    A = allocate_matrix(SparseMatrixCSC{Float64,Int64}, dh)
+    return A
+end
+function create_system_matrix(SpMatType::Type{<:SparseMatrixCSR}, dh::AbstractDofHandler)
+    A = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return A
 end
diff --git a/src/modeling/electrophysiology/ecg.jl b/src/modeling/electrophysiology/ecg.jl
index b0f47170..9db426db 100644
--- a/src/modeling/electrophysiology/ecg.jl
+++ b/src/modeling/electrophysiology/ecg.jl
@@ -402,6 +402,7 @@ function Geselowitz1989ECGLeadCache(
     linear_solver        = LinearSolve.KrylovJL_CG(),
     solution_vector_type = Vector{Float64},
     system_matrix_type   = ThreadedSparseMatrixCSR{Float64,Int64},
+    strategy             = SequentialAssemblyStrategy(SequentialCPUDevice()),
 )
     return Geselowitz1989ECGLeadCache(
         heart_fun,
@@ -416,6 +417,7 @@ function Geselowitz1989ECGLeadCache(
         linear_solver       ,
         solution_vector_type,
         system_matrix_type  ,
+        strategy            ,
     )
 end

η = convert(Tf, η)
D_Dl1 = adapt(backend, zeros(Tf, N)) # D + Dˡ
last_partsize = N - (nparts - 1) * partsize # size of the last partition
SLbuffer_size = (partsize * (partsize - 1) * (nparts-1)) ÷ 2 + last_partsize * (last_partsize - 1) ÷ 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should store the lower/upper triangular parts in some sparse format. I do not have opinions about which one exactly.

(; backend) = partitioning
# The following code is required because there is no assumption on the compatibality of x with the backend.
@timeit_debug "adapt(backend, y)" _y = adapt(backend, y)
@timeit_debug "_forward_sweep!" _forward_sweep!(_y, P)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of sweep should be ideally a user option -- and must be symmetric if the problem is symmetric, as we tank the symmetry needed for the solver otherwise.

@Abdelrahman912
Copy link
Collaborator Author

# Fields
- `device::AbstractDevice`: The backend used for the preconditioner. More info [AbstractDevice](@ref).
"""

Also, I don't know whether this is done intentionally or simply overlooked, but the reference to AbstractDevice gives 404 (i.e. no api reference to it) even though, devices.jl has already doc strings

@termi-official
Copy link
Collaborator

reference to AbstractDevice gives 404

Yea, I am not sure how to fix this. Nothing worth to spend time on right now tho.

@AbdAlazezAhmed
Copy link
Collaborator

AbdAlazezAhmed commented Dec 9, 2025

reference to AbstractDevice gives 404

Doesn't seem to be referenced in an md file

@Abdelrahman912 Abdelrahman912 marked this pull request as ready for review January 29, 2026 01:36
Comment on lines +58 to +59
G_{SGS} = (D + L^T)^{-1} L (D + L)^{-1} L^T \rightarrow G_{SGS} = G_b G_f\\
x^{(k+1)} = G_{SGS} x^{(k)} + (I - G_{SGS}) A^{-1} b
Copy link
Collaborator

@termi-official termi-official Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct. There shouldn't be an inv(A).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's wierd at first glance and not standard but this is required (from my pov after deep thinking) to explain the difference in an easy way (so it is for demo only, I need to write every thing in G)
proof that this works is easy, we know for fact that the update eq is $x^{k+1} = M^{-1}Nx^{k} + M^{-1}b$ eq (4.22 saad's book) and $A=M-N$. sub with $G = I - M^{-1}A$ (between eq 4.23 and 4.24) in my presumed equation you will get
$x^{k+1} = (I-M^{-1}A)x^{k} + (I - I + M^{-1}A)A^{-1}b$
=> $x^{k+1} = (I-M^{-1}(M-N))x^{k} + M^{-1}b$
=> $x^{k+1} = (M^{-1}N)x^{k} + M^{-1}b$ same as eq (4.22)

I can write note to clarify this is for demo and show the standard eq $x^{k+1} = G x^k + f$ with $f = M^{-1}b$
also I can include this proof in the note

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed correct. However, my point here was rather that this does not reflect what is happening in the implementation. We never invert A at some point directly, because we would be done at that point with solving the system. Does that clarify the problem with the explanation given above from the perspective of a new user?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point. I 'll change it to sth more convenient.

@termi-official
Copy link
Collaborator

termi-official commented Feb 5, 2026

@AbdAlazezAhmed can you test the ECG again using this branch?

@termi-official termi-official merged commit 552693a into main Feb 16, 2026
9 checks passed
@termi-official termi-official deleted the l1gs-mem-fix branch February 16, 2026 11:04
@termi-official termi-official changed the title Big memory allocations in L1GS Fix big memory allocations in L1GS Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Big memory allocation in the L1GS

3 participants