-
Notifications
You must be signed in to change notification settings - Fork 45
[perf] Benchmarks and AC/AC2 contraction improvements #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ec07c62 to
efad51c
Compare
|
@leburgel, @VictorVanthilt, @AFeuerpfeil, I would love to hear your opinions on this |
|
I went through this earlier today but was waiting for the results. Will look at this tomorrow! |
|
Looks very fascinating, but it is not particularly clear to me, whether one option is objectively better. |
|
What I gather from this is that the changes are beneficial for large MPOs that are applied multiple times. I'm thinking cylinder and quantum chemistry Hamiltonians. I'm rather keen to find out what this means for things like I'm most definitely against slowing down code for "non-exotic" Hamiltonians. The work you did with differentiating between the different cases of the Hamiltonian terms was very good, and sped things up a lot, especially for nearest neighbour Hamiltonians which, although not super cool, are used very often. Could it make sense to have a hybrid approach, that still has the specialized hamiltonian derivatives but by supplying a kwarg, the user could switch on one of the above implementations. (I'm thinking like find_groundstate(...; cylinder_mode=true) Could you maybe run some tests for TDVP (which has multiple applications) and approximate with a taylor_cluster mpo of varying order/bonddim? |
|
This is some very good work, with some very nice results. As you said and as I gather from the plots, there's no real clear winner though. While the speedups for complicated Hamiltonians and symmetries are very exciting, I think there will always be a (large) user base that wants to run simulations for nearest-neighbor (or at least not too long range) with trivial symmetries and large bond dimensions. In general, making assumptions on the range of bond dimensions that will be accessed is not really something we can actually do realistically. People will always have the tendency to push in that direction, and losing performance when this is pushed very far is not a prospect I'm particularly fond of. In addition, as Victor said, there are settings where the effective Hamiltonians are not applied in a hot loop, and slowing this down also doesn't sound that good. On the other hand, there will also be people who push in the massively-complex-interatctions direction, in which case it would make no sense at all not to use the clearly superior implementation in that setting. So my opinion is rather annoying, in the sense that my immediate though would be to only change the implementations if we make them switchable between the current version and the above v3. This then immediately destroys all benefits in terms of maintainability and duplication, which defeats quite a bit of the purpose of what you're getting at. But the way I see it, while the new implementation can clearly be a massive win, it would be very hard to convince anyone that it's worth slowing down their specific simulations. |
|
Thanks everyone for the comments!
Given this, I'll probably first try and see if I can get a |
|
Sounds reasonable from my perspective. |
bd05a08 to
cb8fefa
Compare
Revert "Precomputed derivatives III" This reverts commit d704ccf. Precomputed derivatives IV Precomputed derivatives V reduce test terminal clutter update plots script playing around with more kernels remove piracy rework jordanmpo small fix
|
I did a bit more work on this, trying to bring back the JordanMPO machinery while also getting the benefits of the precontracted systems. I ran some more benchmarks, for which the results are below. The versions are:
The main things to notice is that the spread is a lot smaller now for the nearest neighbour and next-nearest neighbour cases, which is to be expected since I didn't really touch the JordanMPO specializations that are mostly useful for that. Looking at the exact numbers, I would even dare say that some of that is likely just due to noise in the data, so I would claim the performance in these cases is unchanged. A different thing to look at is the "unprepared application" vs the "preparation + prepared application" times. |




This is a combination of setting up a more thorough benchmark suite to allow further testing in the future, and various changes for our AC and AC2 contractions.
As already pointed out in #270, and a follow-up in #330 when we are performing repeated applications of the effective local hamiltonian, it can actually be beneficial to alter the contraction order to hoist certain contractions out of the loop.
This is especially important for symmetries, as this avoids doing more permutations within the loop, which becomes increasingly important.
With this in mind, I first refactored the way we are "preparing the operators" before
fixedpointis invoked. Basically, I added an optional hook there to do some pre- and post-processing on the operator and initial vector precisely at the point where we know there will be a repeated application of the operator.This also provides the entry point of the benchmark suite, to be able to gauge the preparation time vs application time.
Then, I've added 3 updated contraction schemes, each with different optimizations.
0. Before the changes
The current state of affairs is that there are two different implementations, one for MPO's and one for Hamiltonians. The Hamiltonian implementation attempts to maximally make use of the specific structure of the
JordanMPOTensors, by more or less manually writing out the different contraction paths.From the tests and benchmarks, this seems to basically only matter for (next-)nearest-neighbor layouts, and the added code complexity should definitely be taken into account.
Additionally, this implementation silently assumes that the environments are coming from MPS that are properly gauged, as it replaces
GL[1]andGR[end]with identities to avoid having to do the contractions.I. Precomputed derivatives
This first implementation is just the simple contraction order change, from
(((GL * x) * O) * O) * GRto((GL * O) * x) * (O * GR).This increases the preparation time, but tests seem to show that this contraction order is beneficial over repeated applications, as long as the MPO virtual space is large compared to the MPS physical space.
An additional optimization here is the realization that
GL * OandO * GRare now denseBlockTensorMaps, so it pays of to convert them toTensorMapoutside of the loop to further avoid some overhead.II. Precomputed derivatives
The second implementation adds an optimization on top of this for all symmetry classes but the
NoBraidingones, by doing a non-planar change of the index order of some of the intermediate tensorsThe idea is that when permuting
(GL * O) * xsuch that it can be BLAS-contracted with(O * GR), we can either do the planarrepartition(GLOX, 2, 3), or the non-planarpermute(GLOX, ((1, 2), (3, 4, 5))).For non-symmetric tensors, it should be obvious that the second is beneficial since it avoids having to do an intermediate permutation altogether, as that permutation is trivial, while the benchmarks seem to show that this effect persists for other symmetries as well.
This is presumably caused by a better data-locality in this permutation vs the repartition.
III. Precomputed derivatives
The final change is even more radical by additionally fusing some of the indices that can be kept together during the entire contraction.
Basically, the contraction is modified to
(Fl * GL * O * Fl') * (Fl * x * Fr') * (Fr * O * GR * Fr'), and the resulting eigenvector is unfused in a post-processing step.There are several benefits to this approach.
The first is that the number of subblocks, and therefore the overhead of dealing with symmetries, is reduced.
The operations can deal with larger contiguous memory regions, therefore further increasing the performance.
A secondary benefit however is that this actually results in
ACandAC2contractions that are indistinguishable from the compilers' perspective, therefore reducing the number of contractions that have to be written and compiled.This somewhat surprisingly leads to code simplification rather than code complexity, which I definitely like.
My benchmark results are added below, where everything is plotted relative to the current status.
I'm not too sure about how to interpret the results.
As suspected, for large MPOs this clearly outperforms the previous implementation, with speedups ranging from 2x to 5x depending on the symmetry and specific bond dimensions, but for smaller MPOs we can clearly see that actually, these "optimizations" are slowing things down.
Theoretically, I think we should expect that as the bond dimension increases further, the effects of changing the contraction order should actually start to dominate and the "normal contraction order" should ultimately win, but it is currently not clear if we can ever reach bond dimensions that reach that regime.
I think I am slightly in favor of changing the implementations, mostly because this means that we are relying as much as possible on TensorKit's contractions, so we can focus our efforts concerning multithreading and hardware acceleration in a single place.
Additionally, I obviously also like that this implementation is a lot easier to maintain, and no longer silently assumes that the MPS are properly gauged.
For further reference, here are the same plots, but now including the "preparation time" of the MPOs, by adding
prep_time + n_applications * contract_time + unprep_time, with n=3 and n=10, which does clearly show that this is only something we want to do for repeated applications.