Skip to content

Conversation

@dellaert
Copy link
Member

@dellaert dellaert commented Jan 4, 2026

Overview

This PR introduces significant performance improvements and architectural updates to the MultifrontalSolver, focusing on cache locality, memory efficiency, and parallel granularity.

Key Optimizations:

  • QR Elimination for Leaves:

    • Implemented SolveMode::QrLeaf to use QR factorization directly on leaf cliques with high aspect ratios (rows >> cols) and no prior Hessian factors.
    • Benefit: Avoids explicitly forming normal equations ($A^T A$) for these blocks, improving numerical stability and significantly improving performance for large BAL files.
  • Cache Locality & "Fused" Path:

    • Introduced a fused eliminateInPlace(graph) path that interleaves the Load (filling matrices) and Eliminate (factorization) steps during the post-order traversal.
    • Benefit: Keeps data "hot" in the cache by processing a clique immediately after loading its data, rather than iterating over the entire graph twice. Clear improvement with TBB.
  • Improved Memory Management:

    • Cached Factorization (RSd_): The solver now explicitly caches the elimination result $[R|S|d]$ in a separate VerticalBlockMatrix (RSd_) instead of reusing the accumulator matrix. This optimizes memory access patterns during the back-substitution (solve) phase. Almost doubles perf in 135 BAL dataset!
  • Parallel Task Tuning:

    • Increased the problem-size threshold for parallel execution in updateSolution (from 10 to 4096).
    • Benefit: Reduces TBB scheduling overhead for small cliques (common in SFM), ensuring they are processed sequentially to maximize cache efficiency.

Architectural Changes

  • Precomputed Symbolic Structure:

    • Decoupled symbolic analysis from solver construction via PrecomputedData.
    • Benefit: will allow creation of solver even before linearization.
  • Damping Support:

    • Added direct support for identity and diagonal damping within the clique structure, facilitating efficient Levenberg-Marquardt implementations.

Timing and Analysis

Below are the timings in Mac, with TBB enabled. M1 Macbook with only 16G Ram.

The optimizations significantly scalability, at the cost of a small regression on very small problems.

Chain benchmarks

T Before After
10 7.79x 5.87x
50 6.44x 5.54x
100 7.13x 7.98x
500 8.70x 11.70x
1000 9.60x 13.30x
5000 10.62x 15.72x

BAL benchmarks

Dataset Ordering Before After
Dubrovnik-16 Metis 2.37x 1.97x
Dubrovnik-16 Schur 2.41x 1.85x
Dubrovnik-16 ColAMD 2.42x 2.00x
Dubrovnik-88 Metis 2.91x 3.30x
Dubrovnik-88 Schur 2.88x 3.37x
Dubrovnik-88 ColAMD 2.97x 3.53x
Dubrovnik-135 Metis 2.41x 3.55x
Dubrovnik-135 Schur 2.77x 3.54x
Dubrovnik-135 ColAMD 2.27x 4.27x

Timings after MultifrontalSolver optimizations, and before:

After

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.210584 s
  Standard GTSAM:     0.441354 s
  Speedup:            2.09586x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.192524 s
  Standard GTSAM:     0.379982 s
  Speedup:            1.97368x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.199722 s
  Standard GTSAM:     0.369379 s
  Speedup:            1.84946x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.190323 s
  Standard GTSAM:     0.379801 s
  Speedup:            1.99557x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.89305 s
  Standard GTSAM:     5.65221 s
  Speedup:            2.98577x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.65063 s
  Standard GTSAM:     5.44253 s
  Speedup:            3.29725x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.67989 s
  Standard GTSAM:     5.6682 s
  Speedup:            3.37415x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.58932 s
  Standard GTSAM:     5.60834 s
  Speedup:            3.52875x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 2.51154 s
  Standard GTSAM:     10.4708 s
  Speedup:            4.16908x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 2.56822 s
  Standard GTSAM:     9.12408 s
  Speedup:            3.55269x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 2.42682 s
  Standard GTSAM:     8.59165 s
  Speedup:            3.54028x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 2.30755 s
  Standard GTSAM:     9.85178 s
  Speedup:            4.26938x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00338821 s
  Standard GTSAM:     0.0199051 s
  Speedup:            5.87482x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    42
  frontals:   max=6 avg=2.38095
  separators: max=4 avg=3.42857
  total dim:  max=6 avg=5.80952
  children:   max=4 avg=0.97619
Clique structure after merge
  cliques:    6
  frontals:   max=28 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=28 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0151301 s
  Standard GTSAM:     0.0838724 s
  Speedup:            5.54341x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    89
  frontals:   max=6 avg=2.24719
  separators: max=4 avg=3.68539
  total dim:  max=6 avg=5.93258
  children:   max=4 avg=0.988764
Clique structure after merge
  cliques:    13
  frontals:   max=24 avg=15.3846
  separators: max=4 avg=3.07692
  total dim:  max=26 avg=18.4615
  children:   max=10 avg=0.923077

Timing results:
  MultifrontalSolver: 0.0195491 s
  Standard GTSAM:     0.155968 s
  Speedup:            7.97828x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    484
  frontals:   max=6 avg=2.06612
  separators: max=4 avg=3.92149
  total dim:  max=6 avg=5.9876
  children:   max=4 avg=0.997934
Clique structure after merge
  cliques:    72
  frontals:   max=16 avg=13.8889
  separators: max=4 avg=3.69444
  total dim:  max=20 avg=17.5833
  children:   max=8 avg=0.986111

Timing results:
  MultifrontalSolver: 0.0598728 s
  Standard GTSAM:     0.700681 s
  Speedup:            11.7028x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    983
  frontals:   max=6 avg=2.03459
  separators: max=4 avg=3.95727
  total dim:  max=6 avg=5.99186
  children:   max=4 avg=0.998983
Clique structure after merge
  cliques:    143
  frontals:   max=22 avg=13.986
  separators: max=4 avg=3.86014
  total dim:  max=24 avg=17.8462
  children:   max=12 avg=0.993007

Timing results:
  MultifrontalSolver: 0.106045 s
  Standard GTSAM:     1.41029 s
  Speedup:            13.299x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    726
  frontals:   max=22 avg=13.7741
  separators: max=4 avg=3.96419
  total dim:  max=24 avg=17.7383
  children:   max=10 avg=0.998623

Timing results:
  MultifrontalSolver: 0.425536 s
  Standard GTSAM:     6.68846 s
  Speedup:            15.7177x

Before

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.141235 s
  Standard GTSAM:     0.333536 s
  Speedup:            2.36156x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.143599 s
  Standard GTSAM:     0.340962 s
  Speedup:            2.37441x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.140152 s
  Standard GTSAM:     0.338447 s
  Speedup:            2.41486x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.149368 s
  Standard GTSAM:     0.361274 s
  Speedup:            2.41869x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.40903 s
  Standard GTSAM:     4.02093 s
  Speedup:            2.85369x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.39488 s
  Standard GTSAM:     4.0557 s
  Speedup:            2.90757x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.38117 s
  Standard GTSAM:     3.97215 s
  Speedup:            2.87594x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.35405 s
  Standard GTSAM:     4.02011 s
  Speedup:            2.96896x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 12.6782 s
  Standard GTSAM:     9.27636 s
  Speedup:            0.731678x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 3.42751 s
  Standard GTSAM:     8.24493 s
  Speedup:            2.40552x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 3.11985 s
  Standard GTSAM:     8.63704 s
  Speedup:            2.76842x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 3.90879 s
  Standard GTSAM:     8.87071 s
  Speedup:            2.26943x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00269038 s
  Standard GTSAM:     0.0209533 s
  Speedup:            7.78826x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    42
  frontals:   max=6 avg=2.38095
  separators: max=4 avg=3.42857
  total dim:  max=6 avg=5.80952
  children:   max=4 avg=0.97619
Clique structure after merge
  cliques:    6
  frontals:   max=28 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=28 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0141262 s
  Standard GTSAM:     0.0909991 s
  Speedup:            6.44188x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    89
  frontals:   max=6 avg=2.24719
  separators: max=4 avg=3.68539
  total dim:  max=6 avg=5.93258
  children:   max=4 avg=0.988764
Clique structure after merge
  cliques:    13
  frontals:   max=24 avg=15.3846
  separators: max=4 avg=3.07692
  total dim:  max=26 avg=18.4615
  children:   max=10 avg=0.923077

Timing results:
  MultifrontalSolver: 0.0235363 s
  Standard GTSAM:     0.167806 s
  Speedup:            7.12967x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    484
  frontals:   max=6 avg=2.06612
  separators: max=4 avg=3.92149
  total dim:  max=6 avg=5.9876
  children:   max=4 avg=0.997934
Clique structure after merge
  cliques:    72
  frontals:   max=16 avg=13.8889
  separators: max=4 avg=3.69444
  total dim:  max=20 avg=17.5833
  children:   max=8 avg=0.986111

Timing results:
  MultifrontalSolver: 0.0821268 s
  Standard GTSAM:     0.714632 s
  Speedup:            8.70157x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    983
  frontals:   max=6 avg=2.03459
  separators: max=4 avg=3.95727
  total dim:  max=6 avg=5.99186
  children:   max=4 avg=0.998983
Clique structure after merge
  cliques:    143
  frontals:   max=22 avg=13.986
  separators: max=4 avg=3.86014
  total dim:  max=24 avg=17.8462
  children:   max=12 avg=0.993007

Timing results:
  MultifrontalSolver: 0.142666 s
  Standard GTSAM:     1.36918 s
  Speedup:            9.59714x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    726
  frontals:   max=22 avg=13.7741
  separators: max=4 avg=3.96419
  total dim:  max=24 avg=17.7383
  children:   max=10 avg=0.998623

Timing results:
  MultifrontalSolver: 0.6308 s
  Standard GTSAM:     6.6965 s
  Speedup:            10.6159x

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces significant performance optimizations to the MultifrontalSolver, focusing on cache locality, memory efficiency, and parallel execution granularity. The changes implement QR factorization for leaf cliques, a fused load-and-eliminate path, cached factorization results, and improved memory management, delivering substantial speedups on larger datasets while introducing architectural improvements for precomputed symbolic data.

Key changes:

  • Implements QR elimination mode for high aspect-ratio leaf cliques without prior Hessian factors
  • Introduces fused eliminateInPlace(graph) that interleaves loading and elimination in a single traversal
  • Caches factorization results in separate RSd_ matrix for optimized back-substitution
  • Adds precomputed symbolic data support via PrecomputedData struct and Precompute() method
  • Increases parallel task threshold from 10 to 4096 to reduce scheduling overhead for small cliques

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
gtsam/linear/MultifrontalSolver.h Adds PrecomputedData struct, new constructor overload, Precompute() static method, eliminateInPlace(graph) overload, and state tracking flags
gtsam/linear/MultifrontalSolver.cpp Refactors constructor to use precomputed data, implements fused elimination path, adds row counting for VBM sizing, updates parallel thresholds
gtsam/linear/MultifrontalClique.h Adds QR mode, damping methods, RSd_ caching, prepareForElimination/factorize separation, IndexedSymbolicFactor row tracking
gtsam/linear/MultifrontalClique.cpp Implements lazy SBM allocation, QR leaf factorization, damping support, cached factorization, separator-only SBM for QR updates
gtsam/base/SymmetricBlockMatrix.h Adds utility methods for diagonal damping (addToDiagonalBlock, addScaledIdentity)
gtsam/linear/tests/testMultifrontalSolver.cpp Updates tests to explicitly call load(), adds test for new eliminateInPlace(graph) API
timing/timeSFMBAL.h Adds createOrderings helper function, minor formatting fixes
timing/timeMultifrontalSolver.cpp Refactors benchmarks into separate functions, uses fused elimination path, adds BAL135 test

@dellaert
Copy link
Member Author

dellaert commented Jan 5, 2026

Linux Timing (TBB)

Similar conclusions. Chain improvement is massive, BAL less so, and speedups against GTSAM are less pronounced than on Mac, but this might just be because Linux is a 20 core machine with 32G RAM and a nice cache:

  L1d:                       768 KiB (20 instances)
  L1i:                       1 MiB (20 instances)
  L2:                        28 MiB (11 instances)
  L3:                        33 MiB (1 instance)

Analysis

T Before After
10 11.19x 6.21x
50 4.63x 6.74x
100 8.02x 10.86x
500 9.31x 14.89x
1000 11.16x 24.06x
5000 11.63x 26.62x
Dataset Ordering Before After
Dubrovnik-16 Metis 1.57x 1.93x
Dubrovnik-16 Schur 1.65x 1.84x
Dubrovnik-16 ColAMD 1.58x 1.93x
Dubrovnik-88 Metis 1.24x 1.44x
Dubrovnik-88 Schur 1.24x 1.37x
Dubrovnik-88 ColAMD 1.15x 1.43x
Dubrovnik-135 Metis 1.08x 1.40x
Dubrovnik-135 Schur 1.08x 1.23x
Dubrovnik-135 ColAMD 1.08x 1.30x

After

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.114597 s
  Standard GTSAM:     0.231649 s
  Speedup:            2.02142x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.115449 s
  Standard GTSAM:     0.223312 s
  Speedup:            1.93429x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.11521 s
  Standard GTSAM:     0.212224 s
  Speedup:            1.84206x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.11649 s
  Standard GTSAM:     0.225308 s
  Speedup:            1.93415x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.93307 s
  Standard GTSAM:     1.32023 s
  Speedup:            1.41494x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.915436 s
  Standard GTSAM:     1.32052 s
  Speedup:            1.4425x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.929393 s
  Standard GTSAM:     1.27675 s
  Speedup:            1.37375x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.890635 s
  Standard GTSAM:     1.27797 s
  Speedup:            1.4349x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.59295 s
  Standard GTSAM:     2.05534 s
  Speedup:            1.29027x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.45838 s
  Standard GTSAM:     2.04895 s
  Speedup:            1.40495x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.67949 s
  Standard GTSAM:     2.06911 s
  Speedup:            1.23199x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.64349 s
  Standard GTSAM:     2.13348 s
  Speedup:            1.29814x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00432555 s
  Standard GTSAM:     0.0268508 s
  Speedup:            6.2075x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    41
  frontals:   max=6 avg=2.43902
  separators: max=4 avg=3.41463
  total dim:  max=6 avg=5.85366
  children:   max=4 avg=0.97561
Clique structure after merge
  cliques:    6
  frontals:   max=24 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=24 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0117721 s
  Standard GTSAM:     0.0793824 s
  Speedup:            6.74328x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    90
  frontals:   max=6 avg=2.22222
  separators: max=4 avg=3.68889
  total dim:  max=6 avg=5.91111
  children:   max=4 avg=0.988889
Clique structure after merge
  cliques:    14
  frontals:   max=16 avg=14.2857
  separators: max=4 avg=3.14286
  total dim:  max=20 avg=17.4286
  children:   max=9 avg=0.928571

Timing results:
  MultifrontalSolver: 0.0147134 s
  Standard GTSAM:     0.159792 s
  Speedup:            10.8603x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    485
  frontals:   max=6 avg=2.06186
  separators: max=4 avg=3.92165
  total dim:  max=6 avg=5.98351
  children:   max=4 avg=0.997938
Clique structure after merge
  cliques:    71
  frontals:   max=22 avg=14.0845
  separators: max=4 avg=3.71831
  total dim:  max=24 avg=17.8028
  children:   max=12 avg=0.985915

Timing results:
  MultifrontalSolver: 0.0388443 s
  Standard GTSAM:     0.578314 s
  Speedup:            14.888x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    982
  frontals:   max=6 avg=2.03666
  separators: max=4 avg=3.95723
  total dim:  max=6 avg=5.99389
  children:   max=4 avg=0.998982
Clique structure after merge
  cliques:    144
  frontals:   max=18 avg=13.8889
  separators: max=4 avg=3.83333
  total dim:  max=20 avg=17.7222
  children:   max=10 avg=0.993056

Timing results:
  MultifrontalSolver: 0.0462555 s
  Standard GTSAM:     1.11295 s
  Speedup:            24.0608x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    728
  frontals:   max=22 avg=13.7363
  separators: max=4 avg=3.96154
  total dim:  max=24 avg=17.6978
  children:   max=9 avg=0.998626

Timing results:
  MultifrontalSolver: 0.183805 s
  Standard GTSAM:     4.89218 s
  Speedup:            26.6161x

Before

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.139262 s
  Standard GTSAM:     0.235353 s
  Speedup:            1.69x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.139127 s
  Standard GTSAM:     0.218892 s
  Speedup:            1.57333x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.141829 s
  Standard GTSAM:     0.234691 s
  Speedup:            1.65475x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.143529 s
  Standard GTSAM:     0.226813 s
  Speedup:            1.58026x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.03922 s
  Standard GTSAM:     1.27513 s
  Speedup:            1.22701x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.10096 s
  Standard GTSAM:     1.36733 s
  Speedup:            1.24194x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.05544 s
  Standard GTSAM:     1.30418 s
  Speedup:            1.23568x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.08904 s
  Standard GTSAM:     1.24707 s
  Speedup:            1.14511x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.77369 s
  Standard GTSAM:     2.11584 s
  Speedup:            1.1929x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.89476 s
  Standard GTSAM:     2.05151 s
  Speedup:            1.08273x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.85406 s
  Standard GTSAM:     2.00922 s
  Speedup:            1.08369x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.79971 s
  Standard GTSAM:     1.94567 s
  Speedup:            1.0811x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00273653 s
  Standard GTSAM:     0.0306251 s
  Speedup:            11.1912x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    41
  frontals:   max=6 avg=2.43902
  separators: max=4 avg=3.41463
  total dim:  max=6 avg=5.85366
  children:   max=4 avg=0.97561
Clique structure after merge
  cliques:    6
  frontals:   max=24 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=24 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0137741 s
  Standard GTSAM:     0.0637868 s
  Speedup:            4.63092x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    90
  frontals:   max=6 avg=2.22222
  separators: max=4 avg=3.68889
  total dim:  max=6 avg=5.91111
  children:   max=4 avg=0.988889
Clique structure after merge
  cliques:    14
  frontals:   max=16 avg=14.2857
  separators: max=4 avg=3.14286
  total dim:  max=20 avg=17.4286
  children:   max=9 avg=0.928571

Timing results:
  MultifrontalSolver: 0.0155799 s
  Standard GTSAM:     0.124997 s
  Speedup:            8.02298x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    485
  frontals:   max=6 avg=2.06186
  separators: max=4 avg=3.92165
  total dim:  max=6 avg=5.98351
  children:   max=4 avg=0.997938
Clique structure after merge
  cliques:    71
  frontals:   max=22 avg=14.0845
  separators: max=4 avg=3.71831
  total dim:  max=24 avg=17.8028
  children:   max=12 avg=0.985915

Timing results:
  MultifrontalSolver: 0.0528516 s
  Standard GTSAM:     0.492075 s
  Speedup:            9.3105x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    982
  frontals:   max=6 avg=2.03666
  separators: max=4 avg=3.95723
  total dim:  max=6 avg=5.99389
  children:   max=4 avg=0.998982
Clique structure after merge
  cliques:    144
  frontals:   max=18 avg=13.8889
  separators: max=4 avg=3.83333
  total dim:  max=20 avg=17.7222
  children:   max=10 avg=0.993056

Timing results:
  MultifrontalSolver: 0.0921179 s
  Standard GTSAM:     1.02779 s
  Speedup:            11.1573x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    728
  frontals:   max=22 avg=13.7363
  separators: max=4 avg=3.96154
  total dim:  max=24 avg=17.6978
  children:   max=9 avg=0.998626

Timing results:
  MultifrontalSolver: 0.421384 s
  Standard GTSAM:     4.9015 s
  Speedup:            11.6319x

@dellaert
Copy link
Member Author

dellaert commented Jan 5, 2026

Unfortunately, without TBB on Linux, for BAL datasets, we have worse than GTSAM results - which stumps me. The overall picture seems to be:

Chains:

Single-thread TBB
Mac Faster Much Faster
Linux Faster Much Much Faster

BAL:

Single-thread TBB
Mac Faster Much Faster
Linux Slower Faster

Some detailed profiling on Linux, single-thread, might at least bring that on par.

@dellaert dellaert requested a review from ProfFan January 5, 2026 05:40
@dellaert
Copy link
Member Author

dellaert commented Jan 5, 2026

@ProfFan the RSd_ refactor works, but on Linux still no improvement for single-threaded case (for large BAL): the hotspot just shifted to inPlaceQR. Switching to Cholesky did not help. Still stumped, but I think this PR can be merged.

@dellaert
Copy link
Member Author

dellaert commented Jan 5, 2026

PS for chains even single-threaded Linux is 4-6 times faster, it's really something about the massive fan-in for BAL cliques.

Copy link
Collaborator

@ProfFan ProfFan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do more profiling with this PR merged

@dellaert dellaert merged commit 3eebc5b into develop Jan 6, 2026
34 checks passed
@dellaert dellaert deleted the feature/fasterSolver branch January 6, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants