Zassenhaus Formula and Error Generator Propagation Performance Improvements by coreyostrove · Pull Request #687 · sandialabs/pyGSTi

coreyostrove · 2025-11-19T06:00:32Z

This PR contains two mostly disjoint sets of updates for error generator propagation: an implementation of the Zassenhaus formula (up to second order), and performance improvements for linear sensitivities for probabilities and expectation values (and related functions).

Zassenhaus Formula:

The Zassenhaus formula is kind of like a dual to the well known BCH approximation. In a nutshell, it allows one to take an exponentiated sum of operators and write it as a product of single-exponentials with some multiplicative corrections:

$\exp(\sum_i A_i) = (\prod_i \exp(A_i)) (\prod_j \exp(C_j))$

Where the $C_j$ terms are Lie polynomials of order $j$ in the $A_i$s. At present I have only implemented the second-order correction term ($C_2$). I found a paper that had purported to derive the higher-order terms of the generalized Zassenhaus formula (99.9% of the time online only the special case for a sum of exactly two operators is given), but when I implemented and tested the third order term I concluded the published result wasn't correct and the mistake in it wasn't immediately obvious. So, if we ever want the third order or higher corrections we'll probably need to sit down and do the algebra ourselves. The basic idea is straightforward and I can describe how one would do so if they're in the mood for some tedious algebra. Conveniently, the second-order generalized Zassenhaus correction is exactly the same as the second-order Magnus expansion correction, so I've refactored that code to enable reuse.

The function for computing this is in errgenproptools.py, and some basic correctness tests have been added.

Performance Improvements for Linear Sensitivities:

A few weeks ago during a meeting we were talking about a computation which was being done that relied on the linear sensitivity (alpha) code and @jordanh6 proclaimed, and I quote, "the alpha code is obnoxiously slow!" Taking this as the assault on my pride it was I've been plotting my revenge ever since. Since I had this code fresh in my mind for the Zassenhaus thing I decided to piggyback off of that and enact my scheme. Changes include:

New functions, bulk_alpha and bulk_alpha_pauli, for computing linear sensitivities for a number of error generators and bitstrings/paulis all at once. This reduces overhead associated with instantiating stim.TableauSimulator instances and enables other economies of scale (see next bullet).
New bulk_phi function for computing a number of phi values all at once. A big part of the performance gains here come from performing some additional preprocessing caching intermediate results and identifying unique pauli effective pauli pairs to reduce redundancy.
Reduced copy overhead in amplitude_of_state by resetting the TableauSimulator after use rather than working on copies (much faster, it turns out).
Cythonization: All of the core subroutines, as well as bulk_alpha and bulk_alpha_pauli now have cythonized counterparts for even more efficiency. These are aliased in automatically when cython extensions have been built, and we still have fallback pure python implementations. For some of the subroutines, e.g. the pauli_phase_update function, the mostly C-code cython implementations are a couple orders of magnitude faster. For others (the bulk_alpha_pauli for example) the difference is meaningful but less stark (anywhere from a 20% improvement to a factor of 2).

Only semi-related note, I also ran into an edge case with underflowing values in certain phi computation. I've also included a patch for this issue too by adding a new option to amplitude_of_state which returns just the phase of a complex amplitude for a given state (which is all that is actually needed for phi computations).

A couple other things. I've reworked the stabilizer probability and expectation value correction functions to use these new bulk methods, so those should benefit from this change. The bulk methods also support computing multiple bitstring/paulis at once. There are a couple of functions which could benefit from this, but I haven't reworked those yet to take advantage of this (so there are a few mostly low effort performance wins to be had if anyone is bored).

Benchmarks

Here is a table of benchmarks comparing this branch to the current tip of develop. Benchmarking notebook (and helper module) attached for those curious in the details.
2025-11-17_Profile_Alpha_Comps.ipynb
model.py

Note that for "bulk" methods, since these don't directly exist on develop, the comparison being made is to computing these value for a full list or error generators using a simple for loop (as one would have done in practice presently). For both the 4- and 100-qubit models the noise model is a mostly arbitrarily selected H+S error model. The 4 qubit circuits are depth 4 random circuits. The 100 qubit circuit is depth 100 and also randomly sampled. I.e. the 4 qubit results are the average over 10 random circuits, but the 100 qubit result is just for a single random instantiation (I don't expect this to make a big difference, but didn't feel like waiting around for statistics). The error generator list for the alpha computations in both cases come from the first-order magnus approximation of the propagated generators for that circuit.

Function	Num Qubits	Develop	Current	Improvement (X)
`bulk_alpha`	4	48.2ms	770us	62.6
`bulk_alpha`	100	42.4s	3.32s	12.8
`bulk_alpha_pauli`	4	3.2ms	134us	23.9
`bulk_alpha_pauli`	100	5.91s	6.14ms	962.5*
`amplitude_of_state`	4	120us	67us	1.8
`amplitude_of_state`	100	1.66ms	1.65ms	1^
`pauli_phase_update`	4	47us	164ns	286
`pauli_phase_update`	100	123us	426ns	289

*=When I first saw this value it was very unexpected (all of my detailed profiling was done in the 4-qubit case). But I double checked and the answer is same in both versions of the code. I'd have to do some more profiling to see what exactly made such a big difference between the two. After I did confirm it was correct the fact that I was so tantalizingly close to @rileyjmurray's namesake wasn't lost one me.
^= Most of the change to this function was the simulator copying overhead. I guess this isn't a big deal for the many-qubit case.

As for my revenge, that was secured when I got @jordanh6 to agree to review this PR last week before he knew these changes (and their many associated lines of code) were included 😈.

Add an implementation of the Zassenhaus formula up to third-order. Refactor some of the magnus expansion code to enable reusing the computation of the second-order correction logic. Add a corresponding numerical implementation for use in testing infrastructure.

In my testing the expression for the third-order correction did not appear to result in the correct answer (with the approximation error from the third-order zassenhaus being the same or sometimes worse than second order). I was basing this off of the formulas for the generalized zassenhaus in https://arxiv.org/pdf/1903.03140. I thought I figured out what the error might have been (a missing less-than on a sum index in the formula for W3 on page 5, but that didn't seem to do the trick either). If we want to have the higher-order corrections we'll probably need to work out the correct formulas ourselves.

Add simple unit tests verifying that the analytic and numerical implementations give the same result for a random circuit, and that the second-order term is better than the first-order term.

This commit is a checkpoint for a number of in-progress updates to the computation for the alphas which signficantly improves performance. Changes include: - New bulk_alpha function which adds plumbing for reusing stabilizer simulators. - new bulk_phi method for reusing intermediate values and identifying/reducing duplicate computations. -updated pauli_phase_update computations and specialized version for the all-zeros string. -fast optimized cython implementation of the specialized all-zeros pauli_phase_update computation.

This checkpoint includes: -updates to amplitude_of_state which reduce the amount of tableau simulator copy overhead. - new cythonized implementation for pauli_phase_update and pauli_phase_update_all_zeros - an initial draft of a cythonized version of bulk_phi

This checkpoints progress on continuing the cythonization of the error generator alpha computations. There are now: - cython implementation of amplitude_of_state - cython implementation of bulk_alpha I've also made some relatively minor API changes for bulk_phi and bulk_alpha to have them now return numpy arrrays rather than lists or nested lists.

Update the implementation of the alpha_pauli function to include a new bulk version which reuses the stim simulator instantiation and makes a few other changes to improve performance. Additionally adds a cythonized version of this new bulk_alpha_pauli function.

Reworks the implementations of stabilizer_probability_correction and stabilizer_pauli_expectation_value_correction to use the new bulk_alpha and bulk_alpha_pauli methods.

Adds unit tests for new bulk computation methods.

Also add unit tests for new pauli phase update function, and patch a couple changes in existing unit tests.

Minor fix for a unit test

Patch an issue waiting to happen with underflowing values when dealing with very small amplitude values on states. Also fixes a related issue with the computation of phi values. This previously used the amplitudes directly, which could fail to get properly accounted for in the sign check routine I had. This was patched by adding in a specific mode for amplitude_of_state which just computes the phase of the amplitude, which is all that is needed for phi computations.

Update typing and function signatures in unit tests for amplitude_of_state.

nkoskelo · 2025-12-31T20:59:43Z

pygsti/tools/fasterrgencalc.pyx

+      'Z' -> 1
+    For any unrecognized operator, defaults to 1.
+    """
+    if op == 95:  # '_' ASCII 95


Should we use the short circuiting approach here? Are we expecting _ or X to be more common than Y for the value of op? If not we could refactor this to just check for Y first and then return 1 if it is not Y.

This will be workload dependent, but there are some workloads where I think we might periodically see Ys. This would benefit from some statistics gathered on user workloads. Regardless, this is a good suggestion and would bring the average number of checks to a bit over 1 (right now it would probably be a bit over 2), so we should do it.

nkoskelo · 2025-12-31T21:10:28Z

pygsti/tools/fasterrgencalc.pyx

+    if not out_buffer:
+        raise MemoryError("Failed to allocate memory for output buffer")
+     # Initialize the buffer with the ASCII code for '0' (48).
+    for i in range(n):


We do not need to initialize the output buffer. We can just set the value to 48 if it fails the pauli_flip_ascii(op) check. Yes the compiler may be able to vectorize the first loop easier but loop fusion may still result in a slight speed up.

Good suggestion, I'll give this a try and report back.

nkoskelo · 2025-12-31T21:20:53Z

pygsti/tools/fasterrgencalc.pyx

+        op = p[prefix_len + i]
+        # Get the corresponding input bit from bit_str.
+        # (Assume bit_str characters are '0' or '1')
+        if b[i] == 49: # ASCII '1'


If b[i] == 49 or 48, then we could just use bit_val = 49 - b[i] since subtraction is typically cheaper than a jump.

Very clever! Will report back!

coreyostrove · 2026-01-02T19:21:13Z

@nkoskelo: Thanks for the suggestions! Question for you related to a couple of your suggestions: How much experience do you have with profiling tools/workflows for C/C++ code?

E.g. in "If b[i] == 49 or 48, then we could just use bit_val = 49 - b[i] since subtraction is typically cheaper than a jump." your suggestion here is to make this branchless which is good for reducing potential overheads from branch mispredictions, but I don't currently have a handle on how often branch misses are happening. I know tools exist for profiling things like branch misses and other low-level execution information, but I've never used them. Do you have any experience with these? I suspect going further with optimization would benefit heavily from access to lower-level profiling.

There were a few places where I tried replacing some of the nested if-else table lookups with switch statements with this in mind, for example, but didn't find this made much difference in my testing (but I didn't have access to the lower-level profiling for those changes).

nkoskelo · 2026-01-16T03:00:06Z

pygsti/tools/fasterrgencalc.pyx

+    dptr = PyUnicode_AsUTF8(desired_state)
+    for q in range(n):
+        # ASCII '1' = 49
+        bs[q] = 1 if dptr[q] == 49 else 0


If we assume dptr[q] is only 48 or 49 we can use the same trick again as in pauli_phase_update.

nkoskelo · 2026-01-16T03:33:00Z

Commit c10583e resulted in a small slowdown compared to 68828d3 for the pauli_phase_update with 100 qubits in my testing. However, it was still faster than those quoted in 68828d3 completing on the order of 60 ns instead of 426ns. The other test cases saw a 1.01x to 1.11x speedup compared to the prior commit. I converted the jupyter notebook into a python file for ease of testing.
benchmark.py

rileyjmurray · 2026-01-16T03:40:24Z

Commit c10583e resulted in a small slowdown compared to 68828d3 for the pauli_phase_update with 100 qubits in my testing. However, it was still faster than those quoted in 68828d3 completing on the order of 60 ns instead of 426ns. The other test cases saw a 1.01x to 1.11x speedup compared to the prior commit. I converted the jupyter notebook into a python file for ease of testing. benchmark.py

Consider trying -O2 instead of -O3. There are situations where -O3 leads to worse performance than -O2.

coreyostrove · 2026-01-16T03:54:10Z

What is the practical effect of including the -fopenmp flag in this context?

Corey Ostrove added 14 commits November 8, 2025 16:30

Add unit tests for zassenhaus

d016d6f

Add simple unit tests verifying that the analytic and numerical implementations give the same result for a random circuit, and that the second-order term is better than the first-order term.

alpha_pauli updates

e3072b4

Update the implementation of the alpha_pauli function to include a new bulk version which reuses the stim simulator instantiation and makes a few other changes to improve performance. Additionally adds a cythonized version of this new bulk_alpha_pauli function.

Rework approximate correctors to use bulk methods

5dbfdd2

Reworks the implementations of stabilizer_probability_correction and stabilizer_pauli_expectation_value_correction to use the new bulk_alpha and bulk_alpha_pauli methods.

Add some unit tests for bulk methods

4e8789e

Adds unit tests for new bulk computation methods.

Additional unit tests

80633bf

Also add unit tests for new pauli phase update function, and patch a couple changes in existing unit tests.

Minor test fix

3881d3c

Minor fix for a unit test

Fix typo spotted by linter

96dec80

Update function signatures

68828d3

Update typing and function signatures in unit tests for amplitude_of_state.

coreyostrove added this to the 0.9.15 milestone Nov 19, 2025

coreyostrove requested a review from jordanh6 November 19, 2025 06:00

coreyostrove self-assigned this Nov 19, 2025

coreyostrove requested review from a team and rileyjmurray as code owners November 19, 2025 06:00

nkoskelo reviewed Dec 31, 2025

View reviewed changes

nkoskelo reviewed Jan 16, 2026

View reviewed changes

Add recommended updates and use the compile flag -O3 -fopenmp.

c10583e

Count the number of each phase and then look up the result in a table.

3b58ef2

ensure global array is constant.

88b3a74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zassenhaus Formula and Error Generator Propagation Performance Improvements#687

Zassenhaus Formula and Error Generator Propagation Performance Improvements#687
coreyostrove wants to merge 17 commits intodevelopfrom
feature-zassenhaus-formula

coreyostrove commented Nov 19, 2025 •

edited

Loading

Uh oh!

nkoskelo Dec 31, 2025

Uh oh!

coreyostrove Jan 2, 2026

Uh oh!

nkoskelo Dec 31, 2025

Uh oh!

coreyostrove Jan 2, 2026

Uh oh!

nkoskelo Dec 31, 2025

Uh oh!

coreyostrove Jan 2, 2026

Uh oh!

coreyostrove commented Jan 2, 2026

Uh oh!

nkoskelo Jan 16, 2026

Uh oh!

nkoskelo commented Jan 16, 2026

Uh oh!

rileyjmurray commented Jan 16, 2026

Uh oh!

coreyostrove commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

coreyostrove commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!