Zassenhaus Formula and Error Generator Propagation Performance Improvements#687
Zassenhaus Formula and Error Generator Propagation Performance Improvements#687coreyostrove wants to merge 17 commits intodevelopfrom
Conversation
Add an implementation of the Zassenhaus formula up to third-order. Refactor some of the magnus expansion code to enable reusing the computation of the second-order correction logic. Add a corresponding numerical implementation for use in testing infrastructure.
In my testing the expression for the third-order correction did not appear to result in the correct answer (with the approximation error from the third-order zassenhaus being the same or sometimes worse than second order). I was basing this off of the formulas for the generalized zassenhaus in https://arxiv.org/pdf/1903.03140. I thought I figured out what the error might have been (a missing less-than on a sum index in the formula for W3 on page 5, but that didn't seem to do the trick either). If we want to have the higher-order corrections we'll probably need to work out the correct formulas ourselves.
Add simple unit tests verifying that the analytic and numerical implementations give the same result for a random circuit, and that the second-order term is better than the first-order term.
This commit is a checkpoint for a number of in-progress updates to the computation for the alphas which signficantly improves performance. Changes include: - New bulk_alpha function which adds plumbing for reusing stabilizer simulators. - new bulk_phi method for reusing intermediate values and identifying/reducing duplicate computations. -updated pauli_phase_update computations and specialized version for the all-zeros string. -fast optimized cython implementation of the specialized all-zeros pauli_phase_update computation.
This checkpoint includes: -updates to amplitude_of_state which reduce the amount of tableau simulator copy overhead. - new cythonized implementation for pauli_phase_update and pauli_phase_update_all_zeros - an initial draft of a cythonized version of bulk_phi
This checkpoints progress on continuing the cythonization of the error generator alpha computations. There are now: - cython implementation of amplitude_of_state - cython implementation of bulk_alpha I've also made some relatively minor API changes for bulk_phi and bulk_alpha to have them now return numpy arrrays rather than lists or nested lists.
Update the implementation of the alpha_pauli function to include a new bulk version which reuses the stim simulator instantiation and makes a few other changes to improve performance. Additionally adds a cythonized version of this new bulk_alpha_pauli function.
Reworks the implementations of stabilizer_probability_correction and stabilizer_pauli_expectation_value_correction to use the new bulk_alpha and bulk_alpha_pauli methods.
Adds unit tests for new bulk computation methods.
Also add unit tests for new pauli phase update function, and patch a couple changes in existing unit tests.
Minor fix for a unit test
Patch an issue waiting to happen with underflowing values when dealing with very small amplitude values on states. Also fixes a related issue with the computation of phi values. This previously used the amplitudes directly, which could fail to get properly accounted for in the sign check routine I had. This was patched by adding in a specific mode for amplitude_of_state which just computes the phase of the amplitude, which is all that is needed for phi computations.
Update typing and function signatures in unit tests for amplitude_of_state.
pygsti/tools/fasterrgencalc.pyx
Outdated
| 'Z' -> 1 | ||
| For any unrecognized operator, defaults to 1. | ||
| """ | ||
| if op == 95: # '_' ASCII 95 |
There was a problem hiding this comment.
Should we use the short circuiting approach here? Are we expecting _ or X to be more common than Y for the value of op? If not we could refactor this to just check for Y first and then return 1 if it is not Y.
There was a problem hiding this comment.
This will be workload dependent, but there are some workloads where I think we might periodically see Ys. This would benefit from some statistics gathered on user workloads. Regardless, this is a good suggestion and would bring the average number of checks to a bit over 1 (right now it would probably be a bit over 2), so we should do it.
pygsti/tools/fasterrgencalc.pyx
Outdated
| if not out_buffer: | ||
| raise MemoryError("Failed to allocate memory for output buffer") | ||
| # Initialize the buffer with the ASCII code for '0' (48). | ||
| for i in range(n): |
There was a problem hiding this comment.
We do not need to initialize the output buffer. We can just set the value to 48 if it fails the pauli_flip_ascii(op) check. Yes the compiler may be able to vectorize the first loop easier but loop fusion may still result in a slight speed up.
There was a problem hiding this comment.
Good suggestion, I'll give this a try and report back.
pygsti/tools/fasterrgencalc.pyx
Outdated
| op = p[prefix_len + i] | ||
| # Get the corresponding input bit from bit_str. | ||
| # (Assume bit_str characters are '0' or '1') | ||
| if b[i] == 49: # ASCII '1' |
There was a problem hiding this comment.
If b[i] == 49 or 48, then we could just use bit_val = 49 - b[i] since subtraction is typically cheaper than a jump.
There was a problem hiding this comment.
Very clever! Will report back!
|
@nkoskelo: Thanks for the suggestions! Question for you related to a couple of your suggestions: How much experience do you have with profiling tools/workflows for C/C++ code? E.g. in "If b[i] == 49 or 48, then we could just use bit_val = 49 - b[i] since subtraction is typically cheaper than a jump." your suggestion here is to make this branchless which is good for reducing potential overheads from branch mispredictions, but I don't currently have a handle on how often branch misses are happening. I know tools exist for profiling things like branch misses and other low-level execution information, but I've never used them. Do you have any experience with these? I suspect going further with optimization would benefit heavily from access to lower-level profiling. There were a few places where I tried replacing some of the nested if-else table lookups with switch statements with this in mind, for example, but didn't find this made much difference in my testing (but I didn't have access to the lower-level profiling for those changes). |
pygsti/tools/fasterrgencalc.pyx
Outdated
| dptr = PyUnicode_AsUTF8(desired_state) | ||
| for q in range(n): | ||
| # ASCII '1' = 49 | ||
| bs[q] = 1 if dptr[q] == 49 else 0 |
There was a problem hiding this comment.
If we assume dptr[q] is only 48 or 49 we can use the same trick again as in pauli_phase_update.
|
Commit c10583e resulted in a small slowdown compared to 68828d3 for the |
Consider trying -O2 instead of -O3. There are situations where -O3 leads to worse performance than -O2. |
|
What is the practical effect of including the -fopenmp flag in this context? |
This PR contains two mostly disjoint sets of updates for error generator propagation: an implementation of the Zassenhaus formula (up to second order), and performance improvements for linear sensitivities for probabilities and expectation values (and related functions).
Zassenhaus Formula:
The Zassenhaus formula is kind of like a dual to the well known BCH approximation. In a nutshell, it allows one to take an exponentiated sum of operators and write it as a product of single-exponentials with some multiplicative corrections:
Where the$C_j$ terms are Lie polynomials of order $j$ in the $A_i$s. At present I have only implemented the second-order correction term ($C_2$ ). I found a paper that had purported to derive the higher-order terms of the generalized Zassenhaus formula (99.9% of the time online only the special case for a sum of exactly two operators is given), but when I implemented and tested the third order term I concluded the published result wasn't correct and the mistake in it wasn't immediately obvious. So, if we ever want the third order or higher corrections we'll probably need to sit down and do the algebra ourselves. The basic idea is straightforward and I can describe how one would do so if they're in the mood for some tedious algebra. Conveniently, the second-order generalized Zassenhaus correction is exactly the same as the second-order Magnus expansion correction, so I've refactored that code to enable reuse.
The function for computing this is in
errgenproptools.py, and some basic correctness tests have been added.Performance Improvements for Linear Sensitivities:
A few weeks ago during a meeting we were talking about a computation which was being done that relied on the linear sensitivity (alpha) code and @jordanh6 proclaimed, and I quote, "the alpha code is obnoxiously slow!" Taking this as the assault on my pride it was I've been plotting my revenge ever since. Since I had this code fresh in my mind for the Zassenhaus thing I decided to piggyback off of that and enact my scheme. Changes include:
bulk_alphaandbulk_alpha_pauli, for computing linear sensitivities for a number of error generators and bitstrings/paulis all at once. This reduces overhead associated with instantiatingstim.TableauSimulatorinstances and enables other economies of scale (see next bullet).bulk_phifunction for computing a number of phi values all at once. A big part of the performance gains here come from performing some additional preprocessing caching intermediate results and identifying unique pauli effective pauli pairs to reduce redundancy.amplitude_of_stateby resetting theTableauSimulatorafter use rather than working on copies (much faster, it turns out).bulk_alphaandbulk_alpha_paulinow have cythonized counterparts for even more efficiency. These are aliased in automatically when cython extensions have been built, and we still have fallback pure python implementations. For some of the subroutines, e.g. thepauli_phase_updatefunction, the mostly C-code cython implementations are a couple orders of magnitude faster. For others (thebulk_alpha_paulifor example) the difference is meaningful but less stark (anywhere from a 20% improvement to a factor of 2).Only semi-related note, I also ran into an edge case with underflowing values in certain
phicomputation. I've also included a patch for this issue too by adding a new option toamplitude_of_statewhich returns just the phase of a complex amplitude for a given state (which is all that is actually needed forphicomputations).A couple other things. I've reworked the stabilizer probability and expectation value correction functions to use these new bulk methods, so those should benefit from this change. The bulk methods also support computing multiple bitstring/paulis at once. There are a couple of functions which could benefit from this, but I haven't reworked those yet to take advantage of this (so there are a few mostly low effort performance wins to be had if anyone is bored).
Benchmarks
Here is a table of benchmarks comparing this branch to the current tip of develop. Benchmarking notebook (and helper module) attached for those curious in the details.
2025-11-17_Profile_Alpha_Comps.ipynb
model.py
Note that for "bulk" methods, since these don't directly exist on develop, the comparison being made is to computing these value for a full list or error generators using a simple for loop (as one would have done in practice presently). For both the 4- and 100-qubit models the noise model is a mostly arbitrarily selected H+S error model. The 4 qubit circuits are depth 4 random circuits. The 100 qubit circuit is depth 100 and also randomly sampled. I.e. the 4 qubit results are the average over 10 random circuits, but the 100 qubit result is just for a single random instantiation (I don't expect this to make a big difference, but didn't feel like waiting around for statistics). The error generator list for the alpha computations in both cases come from the first-order magnus approximation of the propagated generators for that circuit.
bulk_alphabulk_alphabulk_alpha_paulibulk_alpha_pauliamplitude_of_stateamplitude_of_statepauli_phase_updatepauli_phase_update*=When I first saw this value it was very unexpected (all of my detailed profiling was done in the 4-qubit case). But I double checked and the answer is same in both versions of the code. I'd have to do some more profiling to see what exactly made such a big difference between the two. After I did confirm it was correct the fact that I was so tantalizingly close to @rileyjmurray's namesake wasn't lost one me.
^= Most of the change to this function was the simulator copying overhead. I guess this isn't a big deal for the many-qubit case.
As for my revenge, that was secured when I got @jordanh6 to agree to review this PR last week before he knew these changes (and their many associated lines of code) were included 😈.