[perf] Benchmarks and AC/AC2 contraction improvements #345

lkdvos · 2025-11-03T19:07:19Z

This is a combination of setting up a more thorough benchmark suite to allow further testing in the future, and various changes for our AC and AC2 contractions.

As already pointed out in #270, and a follow-up in #330 when we are performing repeated applications of the effective local hamiltonian, it can actually be beneficial to alter the contraction order to hoist certain contractions out of the loop.
This is especially important for symmetries, as this avoids doing more permutations within the loop, which becomes increasingly important.

With this in mind, I first refactored the way we are "preparing the operators" before fixedpoint is invoked. Basically, I added an optional hook there to do some pre- and post-processing on the operator and initial vector precisely at the point where we know there will be a repeated application of the operator.
This also provides the entry point of the benchmark suite, to be able to gauge the preparation time vs application time.

Then, I've added 3 updated contraction schemes, each with different optimizations.

0. Before the changes

The current state of affairs is that there are two different implementations, one for MPO's and one for Hamiltonians. The Hamiltonian implementation attempts to maximally make use of the specific structure of the JordanMPOTensors, by more or less manually writing out the different contraction paths.

From the tests and benchmarks, this seems to basically only matter for (next-)nearest-neighbor layouts, and the added code complexity should definitely be taken into account.
Additionally, this implementation silently assumes that the environments are coming from MPS that are properly gauged, as it replaces GL[1] and GR[end] with identities to avoid having to do the contractions.

I. Precomputed derivatives

This first implementation is just the simple contraction order change, from (((GL * x) * O) * O) * GR to ((GL * O) * x) * (O * GR).
This increases the preparation time, but tests seem to show that this contraction order is beneficial over repeated applications, as long as the MPO virtual space is large compared to the MPS physical space.
An additional optimization here is the realization that GL * O and O * GR are now dense BlockTensorMaps, so it pays of to convert them to TensorMap outside of the loop to further avoid some overhead.

II. Precomputed derivatives

The second implementation adds an optimization on top of this for all symmetry classes but the NoBraiding ones, by doing a non-planar change of the index order of some of the intermediate tensors
The idea is that when permuting(GL * O) * x such that it can be BLAS-contracted with (O * GR), we can either do the planar repartition(GLOX, 2, 3), or the non-planar permute(GLOX, ((1, 2), (3, 4, 5))).
For non-symmetric tensors, it should be obvious that the second is beneficial since it avoids having to do an intermediate permutation altogether, as that permutation is trivial, while the benchmarks seem to show that this effect persists for other symmetries as well.
This is presumably caused by a better data-locality in this permutation vs the repartition.

III. Precomputed derivatives

The final change is even more radical by additionally fusing some of the indices that can be kept together during the entire contraction.
Basically, the contraction is modified to (Fl * GL * O * Fl') * (Fl * x * Fr') * (Fr * O * GR * Fr'), and the resulting eigenvector is unfused in a post-processing step.

There are several benefits to this approach.
The first is that the number of subblocks, and therefore the overhead of dealing with symmetries, is reduced.
The operations can deal with larger contiguous memory regions, therefore further increasing the performance.
A secondary benefit however is that this actually results in AC and AC2 contractions that are indistinguishable from the compilers' perspective, therefore reducing the number of contractions that have to be written and compiled.
This somewhat surprisingly leads to code simplification rather than code complexity, which I definitely like.

My benchmark results are added below, where everything is plotted relative to the current status.
I'm not too sure about how to interpret the results.
As suspected, for large MPOs this clearly outperforms the previous implementation, with speedups ranging from 2x to 5x depending on the symmetry and specific bond dimensions, but for smaller MPOs we can clearly see that actually, these "optimizations" are slowing things down.
Theoretically, I think we should expect that as the bond dimension increases further, the effects of changing the contraction order should actually start to dominate and the "normal contraction order" should ultimately win, but it is currently not clear if we can ever reach bond dimensions that reach that regime.

I think I am slightly in favor of changing the implementations, mostly because this means that we are relying as much as possible on TensorKit's contractions, so we can focus our efforts concerning multithreading and hardware acceleration in a single place.
Additionally, I obviously also like that this implementation is a lot easier to maintain, and no longer silently assumes that the MPS are properly gauged.

For further reference, here are the same plots, but now including the "preparation time" of the MPOs, by adding prep_time + n_applications * contract_time + unprep_time, with n=3 and n=10, which does clearly show that this is only something we want to do for repeated applications.

codecov · 2025-11-03T19:15:31Z

Codecov Report

❌ Patch coverage is 35.10972% with 207 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../algorithms/derivatives/hamiltonian_derivatives.jl	46.71%	81 Missing ⚠️
src/utility/utility.jl	36.84%	60 Missing ⚠️
src/algorithms/derivatives/mpo_derivatives.jl	0.00%	35 Missing ⚠️
src/utility/allocator.jl	6.25%	30 Missing ⚠️
src/algorithms/derivatives/derivatives.jl	0.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/MPSKit.jl	`100.00% <ø> (ø)`
src/algorithms/fixedpoint.jl	`69.23% <100.00%> (+9.23%)`	⬆️
src/utility/multiline.jl	`62.96% <100.00%> (-0.68%)`	⬇️
src/algorithms/derivatives/derivatives.jl	`91.30% <0.00%> (-2.03%)`	⬇️
src/utility/allocator.jl	`6.25% <6.25%> (ø)`
src/algorithms/derivatives/mpo_derivatives.jl	`38.09% <0.00%> (-29.26%)`	⬇️
src/utility/utility.jl	`53.52% <36.84%> (-21.80%)`	⬇️
.../algorithms/derivatives/hamiltonian_derivatives.jl	`54.54% <46.71%> (-38.23%)`	⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lkdvos · 2025-11-04T16:06:55Z

@leburgel, @VictorVanthilt, @AFeuerpfeil, I would love to hear your opinions on this

VictorVanthilt · 2025-11-04T16:09:33Z

I went through this earlier today but was waiting for the results. Will look at this tomorrow!

AFeuerpfeil · 2025-11-04T22:48:10Z

Looks very fascinating, but it is not particularly clear to me, whether one option is objectively better.
I will be back at the institute on Wednesday evening, so we can discuss Wednesday or Thursday.

VictorVanthilt · 2025-11-05T14:03:50Z

What I gather from this is that the changes are beneficial for large MPOs that are applied multiple times. I'm thinking cylinder and quantum chemistry Hamiltonians. I'm rather keen to find out what this means for things like approximate where the AC_projections are only applied once instead of multiple times like in krylov methods. Would this speed things up for large high-accuracy time-evolution MPOs? What would the changes mean for MPOs like WII?

I'm most definitely against slowing down code for "non-exotic" Hamiltonians. The work you did with differentiating between the different cases of the Hamiltonian terms was very good, and sped things up a lot, especially for nearest neighbour Hamiltonians which, although not super cool, are used very often.
I have heard from some people that cylinder DMRG was too slow for them in MPSKit and they ended up switching to alternative MPS packages, which hurts my but probably also your feelings :/.

Could it make sense to have a hybrid approach, that still has the specialized hamiltonian derivatives but by supplying a kwarg, the user could switch on one of the above implementations. (I'm thinking like find_groundstate(...; cylinder_mode=true)or find_groundstate(...; preprocessing=true)` or something more catchy.

Could you maybe run some tests for TDVP (which has multiple applications) and approximate with a taylor_cluster mpo of varying order/bonddim?

leburgel · 2025-11-05T19:08:10Z

This is some very good work, with some very nice results. As you said and as I gather from the plots, there's no real clear winner though.

While the speedups for complicated Hamiltonians and symmetries are very exciting, I think there will always be a (large) user base that wants to run simulations for nearest-neighbor (or at least not too long range) with trivial symmetries and large bond dimensions. In general, making assumptions on the range of bond dimensions that will be accessed is not really something we can actually do realistically. People will always have the tendency to push in that direction, and losing performance when this is pushed very far is not a prospect I'm particularly fond of. In addition, as Victor said, there are settings where the effective Hamiltonians are not applied in a hot loop, and slowing this down also doesn't sound that good.

On the other hand, there will also be people who push in the massively-complex-interatctions direction, in which case it would make no sense at all not to use the clearly superior implementation in that setting.

So my opinion is rather annoying, in the sense that my immediate though would be to only change the implementations if we make them switchable between the current version and the above v3. This then immediately destroys all benefits in terms of maintainability and duplication, which defeats quite a bit of the purpose of what you're getting at. But the way I see it, while the new implementation can clearly be a massive win, it would be very hard to convince anyone that it's worth slowing down their specific simulations.

lkdvos · 2025-11-05T20:11:50Z

Thanks everyone for the comments!
I'll go back to the drawing board, but let me outline my current thoughts:

Given what I know about my JordanMPO specializations and the theoretical results, I am quite sure that these improvements are only relevant for NN hamiltonians with single-site updates, or NNN hamiltonians with twosite updates. For all other terms, the JordanMPO implementation was calling the MPO non-precontracted code that is currently still being called. I will try and come up with some benchmarks to solidify this claim and show that for the generic case, this PR is not slowing us down.
I'm not willing to sacrifice performance for general MPOs based solely on NN and NNN models. I would argue that a lot of the interesting physics requires cylinders etc these days, and I have been clearly noticing that we are falling behind and people are abandoning MPSKit because or performances are just not there. (As quite obvious from the benchmark results). I think that our MPO implementation should reflect a generic MPO, and be optimized for that case, which I think the current implementation does achieve.
I do agree that NN and by extension NNN models have their place, and we should aim to optimize for them too. However, in these cases there are actually a lot more performance improvements we could have without the added complexity of the full generic MPO case, so another approach which could work is to simply add a NearestNeighbourHamiltonian and a NextNearestNeighbourHamiltonian, and really squeeze out the most from that. While at first glance this sounds like it might be a little more code, I think the reduced complexity of being able to focus on specific cases would actually reduce the maintenance burden in the long run.

Given this, I'll probably first try and see if I can get a NearestNeighbourHamiltonian implemented without having to add 10_000 new cases of everything, and then re-evaluate the performance of that with respect to the JordanMPO approach.

AFeuerpfeil · 2025-11-05T20:33:30Z

Sounds reasonable from my perspective.
I will take a look at your v3 Code in detail, as I am a little confused by the results:
If we look at computation time, if I recall correctly changing the contraction order to the precomputed one should be a factor of 'd' (or maybe 'd^2' for AC2) slower vs. the non-precomputed contraction.
However, your first data set for Heisenberg NN indicates that it is much worse than that factor of 2 and seems to increase further with bond dimension (could just be a fluke as there are not a lot of data points), as if you have 'xi^3log(xi)'.

Revert "Precomputed derivatives III" This reverts commit d704ccf. Precomputed derivatives IV Precomputed derivatives V reduce test terminal clutter update plots script playing around with more kernels remove piracy rework jordanmpo small fix

This reverts commit c9db523.

lkdvos · 2025-12-17T15:35:14Z

I did a bit more work on this, trying to bring back the JordanMPO machinery while also getting the benefits of the precontracted systems. I ran some more benchmarks, for which the results are below.

The versions are:

v1: base case
v2: precontract mpo and environment, fuse legs that don't have to be separate, use dedicated matrix contraction to avoid recoupling
v3: similar to v2, but explicit loop over matrix contractions to avoid having to copy due to unmatching strides in the dedicate matrix contraction routine
v4: using a buffer

The main things to notice is that the spread is a lot smaller now for the nearest neighbour and next-nearest neighbour cases, which is to be expected since I didn't really touch the JordanMPO specializations that are mostly useful for that. Looking at the exact numbers, I would even dare say that some of that is likely just due to noise in the data, so I would claim the performance in these cases is unchanged.
I do still have the massive speedup of the larger MPO cases, which is why I would argue that once I finish this PR it is actually worth it to go and merge in its current state.

Relative time of prepared application

Relative memory of prepared application

A different thing to look at is the "unprepared application" vs the "preparation + prepared application" times.
To gauge this, I separated out part of the JordanMPO optimizations I did as well, to really show even in the base case what the costs are that are associated to this.
In the plots below, I'm showing only the unprepared contraction for v1 and v4, and then showing the time per application, including the preparation time amortized over either 1, 3 or 10 applications. It's a bit harder to see, but Ive already spent too much time trying to visualize this properly and the main point is that I think the v4 is overall a good fit.
I'll go ahead and iron out the kinks that make the tests fail, but otherwise I would be happy to include these changes

Times for n=1 application

Times for n=3 application

Times for n=10 application

lkdvos force-pushed the ld-benchmark branch 3 times, most recently from ec07c62 to efad51c Compare November 3, 2025 22:35

lkdvos marked this pull request as ready for review November 4, 2025 16:05

lkdvos force-pushed the ld-benchmark branch 7 times, most recently from bd05a08 to cb8fefa Compare December 14, 2025 22:03

lkdvos added 11 commits December 16, 2025 15:23

setup benchmarks

1cf22bc

add main entry point

fb2d90c

add Utility module

e04bd7a

start setting up benchmark suite

917da4c

add heisenberg script

c150ab6

add test suites

fd3de81

small updates

c9727df

update scripts

5b9a510

Precomputed derivatives I

4a29ddb

Precomputed derivatives II

bbda4fd

Precomputed derivatives III

0f7d546

lkdvos added 3 commits December 16, 2025 15:23

add plot script

b5f68cd

Precomputed derivatives IV

946cc63

Revert "Precomputed derivatives III" This reverts commit d704ccf. Precomputed derivatives IV Precomputed derivatives V reduce test terminal clutter update plots script playing around with more kernels remove piracy rework jordanmpo small fix

rework benchmarks

b90c4e7

lkdvos force-pushed the ld-benchmark branch from f1a758a to b90c4e7 Compare December 16, 2025 21:43

lkdvos added 5 commits December 16, 2025 17:00

temporary revert

c9db523

Avoid copy in matrix contract

5d5af38

This reverts commit c9db523.

Add buffer allocator

eca22fd

clean up code

8ea9103

small fixes

733c761

some more fixes

be48fca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf] Benchmarks and AC/AC2 contraction improvements #345

[perf] Benchmarks and AC/AC2 contraction improvements #345

Uh oh!

lkdvos commented Nov 3, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

lkdvos commented Nov 4, 2025

Uh oh!

VictorVanthilt commented Nov 4, 2025

Uh oh!

AFeuerpfeil commented Nov 4, 2025

Uh oh!

VictorVanthilt commented Nov 5, 2025

Uh oh!

leburgel commented Nov 5, 2025

Uh oh!

lkdvos commented Nov 5, 2025

Uh oh!

AFeuerpfeil commented Nov 5, 2025

Uh oh!

lkdvos commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[perf] Benchmarks and AC/AC2 contraction improvements #345

Are you sure you want to change the base?

[perf] Benchmarks and AC/AC2 contraction improvements #345

Uh oh!

Conversation

lkdvos commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

0. Before the changes

I. Precomputed derivatives

II. Precomputed derivatives

III. Precomputed derivatives

Uh oh!

codecov bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lkdvos commented Nov 4, 2025

Uh oh!

VictorVanthilt commented Nov 4, 2025

Uh oh!

AFeuerpfeil commented Nov 4, 2025

Uh oh!

VictorVanthilt commented Nov 5, 2025

Uh oh!

leburgel commented Nov 5, 2025

Uh oh!

lkdvos commented Nov 5, 2025

Uh oh!

AFeuerpfeil commented Nov 5, 2025

Uh oh!

lkdvos commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lkdvos commented Nov 3, 2025 •

edited

Loading

codecov bot commented Nov 3, 2025 •

edited

Loading