QuantumSavory · arnavk23 · Nov 2, 2025 · Nov 2, 2025 · Nov 2, 2025 · Nov 3, 2025
diff --git a/docs/src/datastructures.md b/docs/src/datastructures.md
@@ -71,7 +71,36 @@ Notice the results when the projection operator commutes with the state but is n
 
 We do not use boolean arrays to store information about the qubits as this would be wasteful (7 out of 8 bits in the boolean would be unused). Instead, we use all 8 qubits in a byte and perform bitwise logical operations as necessary. Implementation details of the object in RAM can matter for performance. The library permits any of the standard `UInt` types to be used for packing the bits, and larger `UInt` types (like `UInt64`) are usually faster as they permit working on 64 qubits at a time (instead of 1 if we used a boolean, or 8 if we used a byte).
 
-Moreover, how a tableau is stored in memory can affect performance, as a row-major storage
-usually permits more efficient use of the CPU cache (for the particular algorithms we use).
+### Memory Layout: Row-Major vs Column-Major
 
-Both of these parameters are [benchmarked](bench_intsize.png) (testing the application of a Pauli operator, which is an $\mathcal{O}(n^2)$ operation; and testing the canonicalization of a Stabilizer, which is an $\mathcal{O}(n^3)$ operation). Row-major UInt64 is the best performing and it is  used by default in this library.
+How a tableau is stored in memory significantly affects performance, as different memory layouts provide better cache locality for different operations.
+
+The library uses **row-major (fastrow) layout by default**, where each Pauli string (row of the tableau) is stored contiguously in memory. This layout is optimized for:
+- **Canonicalization operations** (`canonicalize!`) - $\mathcal{O}(n^3)$ operations
+- **Projective measurements** (`project!`) - which frequently iterate over rows
+
+The alternative **column-major (fastcolumn) layout** stores tableau columns (mostly) contiguously in memory. This layout is optimized for:
+- **Applying sparse gates** like `apply!(s, sCNOT(i,j))` - row updates on a few qubits
+- **Pauli multiplications** (left or right)
+
+#### Converting Between Layouts
+
+The functions [`fastrow`](@ref) and [`fastcolumn`](@ref) can be used to convert between memory layouts without changing the logical content of the tableau:
+
+```julia
+s = random_stabilizer(1000)          # Uses default fastrow layout
+s_col = fastcolumn(copy(s))          # Convert to column-major layout
+s_row = fastrow(copy(s_col))         # Convert back to row-major layout
+```
+
+These functions work on all stabilizer data structures: [`Stabilizer`](@ref), [`Destabilizer`](@ref), [`MixedStabilizer`](@ref), and [`MixedDestabilizer`](@ref).
+
+#### Performance Implications
+
+The default row-major (`fastrow`) layout is generally the best choice for typical operations on the CPU. However, if your code performs many sparse gate applications on a specific qubit set, converting to column-major layout may be beneficial.
+
+**Note:** The performance claims above are based on CPU benchmarks. On GPU, the optimal memory layout may differ due to differences in memory access patterns and hardware architecture. Users interested in GPU performance are encouraged to benchmark both layouts for their workloads and to contribute results or suggestions.
+
+The test suite (see e.g. `test/test_bitpack.jl`) only verifies that both memory layouts produce identical results for all operations; it does **not** compare their performance. Actual performance comparisons are performed using scripts in the `benchmark/` directory, which are designed to generate benchmark results suitable for automatic inclusion in the documentation. If you wish to contribute new benchmarks or update performance data, please refer to the scripts in `benchmark/`.
+
+Both of these parameters are [benchmarked](bench_intsize.png) (testing the application of a Pauli operator, which is an $\mathcal{O}(n^2)$ operation; and testing the canonicalization of a Stabilizer, which is an $\mathcal{O}(n^3)$ operation) on CPU. Row-major UInt64 is the best performing and it is used by default in this library for CPU workloads.
diff --git a/test/test_bitpack.jl b/test/test_bitpack.jl
@@ -80,4 +80,35 @@
             @test stab_to_gf2(s) == stab_to_gf2(sr) == stab_to_gf2(sc) == stab_to_gf2(s8) == stab_to_gf2(s8r) == stab_to_gf2(s8c)
         end
     end
+
+    @testset "memory layout performance comparison" begin
+        # fastrow should be faster than fastcolumn for canonicalization
+        s_row = fastrow(random_stabilizer(100, 128))
+        s_col = fastcolumn(copy(s_row))
+
+        # Both layouts should produce identical results
+        result_row = canonicalize!(copy(s_row); phases=true)
+        result_col = canonicalize!(copy(s_col); phases=true)
+        @test stab_to_gf2(result_row) == stab_to_gf2(result_col)
+
+        # Test sparse gate application
+        s_row_gates = fastrow(random_stabilizer(50, 64))
+        s_col_gates = fastcolumn(copy(s_row_gates))
+        gate = sCNOT(1, 2)
+
+        # Apply sparse gates and verify identity
+        s_row_after = apply!(copy(s_row_gates), gate)
+        s_col_after = apply!(copy(s_col_gates), gate)
+        @test stab_to_gf2(s_row_after) == stab_to_gf2(s_col_after)
+
+        # Test dense clifford operator application
+        s_row_clif = fastrow(random_stabilizer(50, 64))
+        s_col_clif = fastcolumn(copy(s_row_clif))
+        c = CliffordOperator(random_destabilizer(64; phases=false))
+
+        # Apply dense gates and verify identity
+        s_row_clif_after = apply!(copy(s_row_clif), c)
+        s_col_clif_after = apply!(copy(s_col_clif), c)
+        @test stab_to_gf2(s_row_clif_after) == stab_to_gf2(s_col_clif_after)
+    end
 end