Skip to content

Conversation

@lkdvos
Copy link
Member

@lkdvos lkdvos commented Nov 3, 2025

This is a combination of setting up a more thorough benchmark suite to allow further testing in the future, and various changes for our AC and AC2 contractions.

As already pointed out in #270, and a follow-up in #330 when we are performing repeated applications of the effective local hamiltonian, it can actually be beneficial to alter the contraction order to hoist certain contractions out of the loop.
This is especially important for symmetries, as this avoids doing more permutations within the loop, which becomes increasingly important.

With this in mind, I first refactored the way we are "preparing the operators" before fixedpoint is invoked. Basically, I added an optional hook there to do some pre- and post-processing on the operator and initial vector precisely at the point where we know there will be a repeated application of the operator.
This also provides the entry point of the benchmark suite, to be able to gauge the preparation time vs application time.

Then, I've added 3 updated contraction schemes, each with different optimizations.

0. Before the changes

The current state of affairs is that there are two different implementations, one for MPO's and one for Hamiltonians. The Hamiltonian implementation attempts to maximally make use of the specific structure of the JordanMPOTensors, by more or less manually writing out the different contraction paths.

From the tests and benchmarks, this seems to basically only matter for (next-)nearest-neighbor layouts, and the added code complexity should definitely be taken into account.
Additionally, this implementation silently assumes that the environments are coming from MPS that are properly gauged, as it replaces GL[1] and GR[end] with identities to avoid having to do the contractions.

I. Precomputed derivatives

This first implementation is just the simple contraction order change, from (((GL * x) * O) * O) * GR to ((GL * O) * x) * (O * GR).
This increases the preparation time, but tests seem to show that this contraction order is beneficial over repeated applications, as long as the MPO virtual space is large compared to the MPS physical space.
An additional optimization here is the realization that GL * O and O * GR are now dense BlockTensorMaps, so it pays of to convert them to TensorMap outside of the loop to further avoid some overhead.

II. Precomputed derivatives

The second implementation adds an optimization on top of this for all symmetry classes but the NoBraiding ones, by doing a non-planar change of the index order of some of the intermediate tensors
The idea is that when permuting(GL * O) * x such that it can be BLAS-contracted with (O * GR), we can either do the planar repartition(GLOX, 2, 3), or the non-planar permute(GLOX, ((1, 2), (3, 4, 5))).
For non-symmetric tensors, it should be obvious that the second is beneficial since it avoids having to do an intermediate permutation altogether, as that permutation is trivial, while the benchmarks seem to show that this effect persists for other symmetries as well.
This is presumably caused by a better data-locality in this permutation vs the repartition.

III. Precomputed derivatives

The final change is even more radical by additionally fusing some of the indices that can be kept together during the entire contraction.
Basically, the contraction is modified to (Fl * GL * O * Fl') * (Fl * x * Fr') * (Fr * O * GR * Fr'), and the resulting eigenvector is unfused in a post-processing step.

There are several benefits to this approach.
The first is that the number of subblocks, and therefore the overhead of dealing with symmetries, is reduced.
The operations can deal with larger contiguous memory regions, therefore further increasing the performance.
A secondary benefit however is that this actually results in AC and AC2 contractions that are indistinguishable from the compilers' perspective, therefore reducing the number of contractions that have to be written and compiled.
This somewhat surprisingly leads to code simplification rather than code complexity, which I definitely like.


My benchmark results are added below, where everything is plotted relative to the current status.
I'm not too sure about how to interpret the results.
As suspected, for large MPOs this clearly outperforms the previous implementation, with speedups ranging from 2x to 5x depending on the symmetry and specific bond dimensions, but for smaller MPOs we can clearly see that actually, these "optimizations" are slowing things down.
Theoretically, I think we should expect that as the bond dimension increases further, the effects of changing the contraction order should actually start to dominate and the "normal contraction order" should ultimately win, but it is currently not clear if we can ever reach bond dimensions that reach that regime.

bench_times_relative bench_memory_relative

I think I am slightly in favor of changing the implementations, mostly because this means that we are relying as much as possible on TensorKit's contractions, so we can focus our efforts concerning multithreading and hardware acceleration in a single place.
Additionally, I obviously also like that this implementation is a lot easier to maintain, and no longer silently assumes that the MPS are properly gauged.

For further reference, here are the same plots, but now including the "preparation time" of the MPOs, by adding prep_time + n_applications * contract_time + unprep_time, with n=3 and n=10, which does clearly show that this is only something we want to do for repeated applications.

bench_prep_times_relative_n=3 bench_prep_times_relative_n=10

@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 35.10972% with 207 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../algorithms/derivatives/hamiltonian_derivatives.jl 46.71% 81 Missing ⚠️
src/utility/utility.jl 36.84% 60 Missing ⚠️
src/algorithms/derivatives/mpo_derivatives.jl 0.00% 35 Missing ⚠️
src/utility/allocator.jl 6.25% 30 Missing ⚠️
src/algorithms/derivatives/derivatives.jl 0.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/MPSKit.jl 100.00% <ø> (ø)
src/algorithms/fixedpoint.jl 69.23% <100.00%> (+9.23%) ⬆️
src/utility/multiline.jl 62.96% <100.00%> (-0.68%) ⬇️
src/algorithms/derivatives/derivatives.jl 91.30% <0.00%> (-2.03%) ⬇️
src/utility/allocator.jl 6.25% <6.25%> (ø)
src/algorithms/derivatives/mpo_derivatives.jl 38.09% <0.00%> (-29.26%) ⬇️
src/utility/utility.jl 53.52% <36.84%> (-21.80%) ⬇️
.../algorithms/derivatives/hamiltonian_derivatives.jl 54.54% <46.71%> (-38.23%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkdvos lkdvos force-pushed the ld-benchmark branch 3 times, most recently from ec07c62 to efad51c Compare November 3, 2025 22:35
@lkdvos lkdvos marked this pull request as ready for review November 4, 2025 16:05
@lkdvos
Copy link
Member Author

lkdvos commented Nov 4, 2025

@leburgel, @VictorVanthilt, @AFeuerpfeil, I would love to hear your opinions on this

@VictorVanthilt
Copy link
Member

I went through this earlier today but was waiting for the results. Will look at this tomorrow!

@AFeuerpfeil
Copy link
Contributor

Looks very fascinating, but it is not particularly clear to me, whether one option is objectively better.
I will be back at the institute on Wednesday evening, so we can discuss Wednesday or Thursday.

@VictorVanthilt
Copy link
Member

What I gather from this is that the changes are beneficial for large MPOs that are applied multiple times. I'm thinking cylinder and quantum chemistry Hamiltonians. I'm rather keen to find out what this means for things like approximate where the AC_projections are only applied once instead of multiple times like in krylov methods. Would this speed things up for large high-accuracy time-evolution MPOs? What would the changes mean for MPOs like WII?

I'm most definitely against slowing down code for "non-exotic" Hamiltonians. The work you did with differentiating between the different cases of the Hamiltonian terms was very good, and sped things up a lot, especially for nearest neighbour Hamiltonians which, although not super cool, are used very often.
I have heard from some people that cylinder DMRG was too slow for them in MPSKit and they ended up switching to alternative MPS packages, which hurts my but probably also your feelings :/.

Could it make sense to have a hybrid approach, that still has the specialized hamiltonian derivatives but by supplying a kwarg, the user could switch on one of the above implementations. (I'm thinking like find_groundstate(...; cylinder_mode=true)or find_groundstate(...; preprocessing=true)` or something more catchy.

Could you maybe run some tests for TDVP (which has multiple applications) and approximate with a taylor_cluster mpo of varying order/bonddim?

@leburgel
Copy link
Member

leburgel commented Nov 5, 2025

This is some very good work, with some very nice results. As you said and as I gather from the plots, there's no real clear winner though.

While the speedups for complicated Hamiltonians and symmetries are very exciting, I think there will always be a (large) user base that wants to run simulations for nearest-neighbor (or at least not too long range) with trivial symmetries and large bond dimensions. In general, making assumptions on the range of bond dimensions that will be accessed is not really something we can actually do realistically. People will always have the tendency to push in that direction, and losing performance when this is pushed very far is not a prospect I'm particularly fond of. In addition, as Victor said, there are settings where the effective Hamiltonians are not applied in a hot loop, and slowing this down also doesn't sound that good.

On the other hand, there will also be people who push in the massively-complex-interatctions direction, in which case it would make no sense at all not to use the clearly superior implementation in that setting.

So my opinion is rather annoying, in the sense that my immediate though would be to only change the implementations if we make them switchable between the current version and the above v3. This then immediately destroys all benefits in terms of maintainability and duplication, which defeats quite a bit of the purpose of what you're getting at. But the way I see it, while the new implementation can clearly be a massive win, it would be very hard to convince anyone that it's worth slowing down their specific simulations.

@lkdvos
Copy link
Member Author

lkdvos commented Nov 5, 2025

Thanks everyone for the comments!
I'll go back to the drawing board, but let me outline my current thoughts:

  • Given what I know about my JordanMPO specializations and the theoretical results, I am quite sure that these improvements are only relevant for NN hamiltonians with single-site updates, or NNN hamiltonians with twosite updates. For all other terms, the JordanMPO implementation was calling the MPO non-precontracted code that is currently still being called. I will try and come up with some benchmarks to solidify this claim and show that for the generic case, this PR is not slowing us down.
  • I'm not willing to sacrifice performance for general MPOs based solely on NN and NNN models. I would argue that a lot of the interesting physics requires cylinders etc these days, and I have been clearly noticing that we are falling behind and people are abandoning MPSKit because or performances are just not there. (As quite obvious from the benchmark results). I think that our MPO implementation should reflect a generic MPO, and be optimized for that case, which I think the current implementation does achieve.
  • I do agree that NN and by extension NNN models have their place, and we should aim to optimize for them too. However, in these cases there are actually a lot more performance improvements we could have without the added complexity of the full generic MPO case, so another approach which could work is to simply add a NearestNeighbourHamiltonian and a NextNearestNeighbourHamiltonian, and really squeeze out the most from that. While at first glance this sounds like it might be a little more code, I think the reduced complexity of being able to focus on specific cases would actually reduce the maintenance burden in the long run.

Given this, I'll probably first try and see if I can get a NearestNeighbourHamiltonian implemented without having to add 10_000 new cases of everything, and then re-evaluate the performance of that with respect to the JordanMPO approach.

@AFeuerpfeil
Copy link
Contributor

Sounds reasonable from my perspective.
I will take a look at your v3 Code in detail, as I am a little confused by the results:
If we look at computation time, if I recall correctly changing the contraction order to the precomputed one should be a factor of 'd' (or maybe 'd^2' for AC2) slower vs. the non-precomputed contraction.
However, your first data set for Heisenberg NN indicates that it is much worse than that factor of 2 and seems to increase further with bond dimension (could just be a fluke as there are not a lot of data points), as if you have 'xi^3log(xi)'.

@lkdvos lkdvos force-pushed the ld-benchmark branch 7 times, most recently from bd05a08 to cb8fefa Compare December 14, 2025 22:03
Revert "Precomputed derivatives III"

This reverts commit d704ccf.

Precomputed derivatives IV

Precomputed derivatives V

reduce test terminal clutter

update plots script

playing around with more kernels

remove piracy

rework jordanmpo

small fix
@lkdvos
Copy link
Member Author

lkdvos commented Dec 17, 2025

I did a bit more work on this, trying to bring back the JordanMPO machinery while also getting the benefits of the precontracted systems. I ran some more benchmarks, for which the results are below.

The versions are:

  • v1: base case
  • v2: precontract mpo and environment, fuse legs that don't have to be separate, use dedicated matrix contraction to avoid recoupling
  • v3: similar to v2, but explicit loop over matrix contractions to avoid having to copy due to unmatching strides in the dedicate matrix contraction routine
  • v4: using a buffer

The main things to notice is that the spread is a lot smaller now for the nearest neighbour and next-nearest neighbour cases, which is to be expected since I didn't really touch the JordanMPO specializations that are mostly useful for that. Looking at the exact numbers, I would even dare say that some of that is likely just due to noise in the data, so I would claim the performance in these cases is unchanged.
I do still have the massive speedup of the larger MPO cases, which is why I would argue that once I finish this PR it is actually worth it to go and merge in its current state.

Relative time of prepared application prepared_times_rel
Relative memory of prepared application

prepared_memory_rel

A different thing to look at is the "unprepared application" vs the "preparation + prepared application" times.
To gauge this, I separated out part of the JordanMPO optimizations I did as well, to really show even in the base case what the costs are that are associated to this.
In the plots below, I'm showing only the unprepared contraction for v1 and v4, and then showing the time per application, including the preparation time amortized over either 1, 3 or 10 applications. It's a bit harder to see, but Ive already spent too much time trying to visualize this properly and the main point is that I think the v4 is overall a good fit.
I'll go ahead and iron out the kinks that make the tests fail, but otherwise I would be happy to include these changes

Times for n=1 application amortized_times_1
Times for n=3 application amortized_times_3
Times for n=10 application amortized_times_10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants