Skip to content

Latest commit

 

History

History
480 lines (384 loc) · 16.6 KB

File metadata and controls

480 lines (384 loc) · 16.6 KB

ADR-QE-009: Tensor Network Evaluation Mode

Status: Proposed Date: 2026-02-06 Authors: ruv.io, RuVector Team Deciders: Architecture Review Board


Context

Full state-vector simulation stores all 2^n complex amplitudes explicitly, yielding O(2^n) memory and O(G * 2^n) time for G gates. At n=30 this is 16 GiB; at n=40 it exceeds 16 TiB. Many practically interesting circuits, however, contain limited entanglement:

Circuit family Entanglement structure Treewidth
Shallow QAOA on sparse graphs Bounded by graph degree Low (often < 20)
Separate-register circuits Disjoint qubit subsets Sum of sub-widths
Near-Clifford circuits Stabilizer + few T gates Depends on T count
1D brickwork (finite depth) Area-law entanglement O(depth)
Random deep circuits (all-to-all) Volume-law entanglement O(n) -- no gain

For the first four families, tensor network (TN) methods can trade increased computation for drastically reduced memory by representing each gate as a tensor and contracting the resulting network in an optimized order. The contraction cost scales exponentially in the treewidth of the circuit's line graph rather than in the total qubit count.

QuantRS2 (the Rust quantum simulation reference) demonstrated tensor network contraction for circuits up to 60 qubits on commodity hardware when treewidth remained below ~25. ruVector's existing ruvector-mincut crate already solves graph partitioning problems that are structurally identical to contraction-order optimization, providing a natural integration point.

The ruQu engine needs this capability to support:

  1. Surface code simulations at distance d >= 7 (49+ data qubits) for decoder validation, where the syndrome extraction circuit is shallow and geometrically local.
  2. Variational algorithm prototyping (VQE, QAOA) on graphs larger than 30 nodes.
  3. Hybrid workflows where part of the circuit is simulated via state vector and part via tensor contraction.

Decision

1. Feature-Gated Backend

Tensor network evaluation is implemented as an optional backend behind the tensor-network feature flag in ruqu-core:

# ruqu-core/Cargo.toml
[features]
default = ["state-vector"]
state-vector = []
tensor-network = ["dep:ndarray", "dep:petgraph"]
all-backends = ["state-vector", "tensor-network"]

When both backends are compiled in, the engine selects the backend at runtime based on circuit analysis (see Section 4 below).

2. Tensor Representation

Every gate becomes a tensor connecting the qubit wire indices it acts on:

Gate type Tensor rank Shape Example
Single-qubit (H, X, Rz, ...) 2 [2, 2] Input wire -> output wire
Two-qubit (CNOT, CZ, ...) 4 [2, 2, 2, 2] Two input wires -> two output wires
Three-qubit (Toffoli) 6 [2, 2, 2, 2, 2, 2] Three input -> three output
Measurement projector 2 [2, 2] Diagonal in computational basis
Initial state 0> 1 [2]

The circuit is converted into a tensor network graph where:

  • Each tensor is a node.
  • Each shared index (qubit wire between consecutive gates) is an edge.
  • Open indices represent initial states and final measurement outcomes.
  |0>---[H]---[CNOT_ctrl]---[Rz]---<meas>
                  |
  |0>-----------[CNOT_tgt]---------<meas>

Becomes:

  Node: init_0 (rank 1)
    |
  Node: H_0 (rank 2)
    |
  Node: CNOT_01 (rank 4)
   / \
  |   Node: Rz_0 (rank 2)
  |     |
  |   Node: meas_0 (rank 2)
  |
  Node: init_1 (rank 1)
    ... (connected via CNOT shared index)
  Node: meas_1 (rank 2)

3. Contraction Strategy

Contraction order determines whether the computation is tractable. The cost of contracting two tensors is the product of the dimensions of all indices involved. Finding the optimal contraction order is NP-hard (equivalent to finding minimum treewidth), so we use heuristics.

Contraction Path Optimization Pseudocode

function find_contraction_path(tensor_network: TN) -> ContractionPath:
    // Phase 1: Simplify the network
    apply_trivial_contractions(tensor_network)  // rank-1 tensors, diagonal pairs

    // Phase 2: Detect community structure
    communities = detect_communities(tensor_network.graph)

    // Phase 3: Contract within communities first (small subproblems)
    intra_paths = []
    for community in communities:
        subgraph = tensor_network.subgraph(community)
        if subgraph.num_tensors <= 20:
            // Exact dynamic programming for small subgraphs
            path = optimal_einsum_dp(subgraph)
        else:
            // Greedy with lookahead for larger subgraphs
            path = greedy_with_lookahead(subgraph, lookahead=2)
        intra_paths.append(path)

    // Phase 4: Contract inter-community edges
    // Each community is now a single large tensor
    meta_graph = contract_communities(tensor_network, intra_paths)
    inter_path = greedy_with_lookahead(meta_graph, lookahead=3)

    // Phase 5: Compose the full path
    return compose_paths(intra_paths, inter_path)


function greedy_with_lookahead(tn: TN, lookahead: int) -> Path:
    path = []
    remaining = tn.clone()

    while remaining.num_tensors > 1:
        best_cost = INFINITY
        best_pair = None

        // Evaluate all candidate contractions
        for (i, j) in remaining.candidate_pairs():
            cost = contraction_cost(remaining, i, j)

            // Lookahead: estimate cost of subsequent contractions
            if lookahead > 0:
                simulated = remaining.simulate_contraction(i, j)
                future_cost = estimate_future_cost(simulated, lookahead - 1)
                cost += future_cost * DISCOUNT_FACTOR

            if cost < best_cost:
                best_cost = cost
                best_pair = (i, j)

        path.append(best_pair)
        remaining.contract(best_pair)

    return path

Community Detection via ruvector-mincut

The ruvector-mincut crate provides graph partitioning that is directly applicable to contraction ordering:

use ruvector_mincut::{partition, PartitionConfig};

fn partition_tensor_network(tn: &TensorNetwork) -> Vec<Vec<TensorId>> {
    let graph = tn.to_adjacency_graph();
    let config = PartitionConfig {
        num_partitions: estimate_optimal_partitions(tn),
        balance_factor: 1.1,  // Allow 10% imbalance
        minimize: Objective::EdgeCut,  // Minimize inter-partition wires
    };
    partition(&graph, &config)
}

The edge cut directly corresponds to the bond dimension of the inter-community contraction, so minimizing edge cut minimizes the most expensive contraction step.

4. MPS (Matrix Product State) Mode

For circuits with 1D-like connectivity (nearest-neighbor gates on a line), a Matrix Product State representation is more efficient than general tensor contraction.

    A[1] -- A[2] -- A[3] -- ... -- A[n]
     |       |       |               |
   phys_1  phys_2  phys_3         phys_n

Each site tensor A[i] has shape [bond_left, physical, bond_right] where:

  • physical = 2 (qubit dimension)
  • bond_left, bond_right = bond dimension chi
Bond dimension (chi) Memory per site Total memory (n qubits) Approximation
1 16 bytes 16n bytes Product state only
16 4 KiB 4n KiB Low entanglement
64 64 KiB 64n KiB Moderate entanglement
256 1 MiB n MiB High entanglement
1024 16 MiB 16n MiB Near exact for many circuits

Truncation policy: After each two-qubit gate, perform SVD on the updated bond. If the bond dimension exceeds chi_max, truncate the smallest singular values. Track the total discarded weight (sum of squared discarded singular values) as a fidelity estimate:

pub struct MpsConfig {
    /// Maximum bond dimension. Truncation occurs above this.
    pub chi_max: usize,
    /// Minimum singular value to retain (relative to largest).
    pub svd_cutoff: f64,
    /// Accumulated truncation error (updated during simulation).
    pub fidelity_estimate: f64,
}

impl Default for MpsConfig {
    fn default() -> Self {
        Self {
            chi_max: 256,
            svd_cutoff: 1e-12,
            fidelity_estimate: 1.0,
        }
    }
}

5. Automatic Mode Selection

The engine analyzes the circuit before execution to recommend a backend:

pub enum RecommendedBackend {
    StateVector { reason: &'static str },
    TensorNetwork { estimated_treewidth: usize, reason: &'static str },
    Mps { estimated_max_bond: usize, reason: &'static str },
}

pub fn recommend_backend(circuit: &QuantumCircuit) -> RecommendedBackend {
    let n = circuit.num_qubits();
    let depth = circuit.depth();
    let connectivity = circuit.connectivity_graph();

    // Rule 1: Small circuits always use state vector
    if n <= 20 {
        return RecommendedBackend::StateVector {
            reason: "Small circuit; state vector is fastest below 20 qubits",
        };
    }

    // Rule 2: Check for 1D connectivity (MPS candidate)
    if connectivity.max_degree() <= 2 && connectivity.is_path_graph() {
        let estimated_bond = 2_usize.pow(depth.min(20) as u32);
        return RecommendedBackend::Mps {
            estimated_max_bond: estimated_bond,
            reason: "1D nearest-neighbor connectivity detected",
        };
    }

    // Rule 3: Estimate treewidth for general TN
    let estimated_tw = estimate_treewidth(&connectivity, depth);
    if estimated_tw < 25 && n > 25 {
        return RecommendedBackend::TensorNetwork {
            estimated_treewidth: estimated_tw,
            reason: "Low treewidth relative to qubit count",
        };
    }

    // Rule 4: Check memory feasibility for state vector
    let sv_memory = 16 * (1_usize << n);  // bytes
    let available = estimate_available_memory();
    if sv_memory > available {
        // Force TN even if treewidth is high -- at least it has a chance
        return RecommendedBackend::TensorNetwork {
            estimated_treewidth: estimated_tw,
            reason: "State vector exceeds available memory; TN is only option",
        };
    }

    RecommendedBackend::StateVector {
        reason: "High treewidth circuit; state vector is more efficient",
    }
}

6. When Tensor Networks Win vs Lose

Tensor networks win when:

Scenario Why TN wins Example
Shallow circuits on many qubits Treewidth ~ depth, not n 50-qubit depth-4 QAOA
Sparse graph connectivity Low treewidth from graph structure MaxCut on 3-regular graph
Separate registers Independent contractions n/2 Bell pairs
Near-Clifford Stabilizer + few non-Clifford gates Clifford + 5 T gates
Amplitude computation Contract to single output, not full state Sampling one bitstring

Tensor networks lose when:

Scenario Why TN loses Fallback
Deep random circuits Treewidth ~ n State vector (if n <= 30)
All-to-all connectivity No structure to exploit State vector
Full state tomography needed Must contract once per amplitude State vector
Very small circuits (n < 20) Overhead exceeds state vector State vector
High-fidelity MPS needed Bond dimension grows exponentially State vector or exact TN

7. Example: 50-Qubit Shallow QAOA

Consider QAOA depth p=1 on a 50-node 3-regular graph:

Circuit structure:
  - 50 qubits, initialized to |+>
  - 75 ZZ gates (one per edge), parameterized by gamma
  - 50 Rx gates, parameterized by beta
  - Total: 125 + 50 = 175 gates
  - Circuit depth: 4 (H layer, ZZ layer (3-colorable), Rx layer, measure)

Graph treewidth of 3-regular graph: typically 8-15

Tensor network contraction:
  - Community detection finds ~5-8 communities of 6-10 nodes
  - Intra-community contraction: O(2^10) ~ 1024 per community
  - Inter-community bonds: ~15 edges cut
  - Effective contraction complexity: O(2^15) = 32768
  - Compare to state vector: O(2^50) = 1.1 * 10^15

Memory comparison:
  - State vector: 2^50 * 16 bytes = 16 PiB (impossible)
  - Tensor network: ~100 MiB working memory
  - Speedup factor: practically infinite (feasible vs infeasible)
Contraction Diagram (simplified):

  Community A        Community B        Community C
  [q0-q9]           [q10-q19]          [q20-q29]
     |                  |                   |
     +--- bond=2^3 ----+---- bond=2^4 -----+
                        |
  Community D        Community E
  [q30-q39]          [q40-q49]
     |                  |
     +--- bond=2^3 ----+

  Peak intermediate tensor: 2^15 elements = 512 KiB

8. Integration with State Vector Backend

Both backends implement the same trait:

pub trait SimulationBackend {
    /// Execute the circuit and return measurement results.
    fn execute(
        &self,
        circuit: &QuantumCircuit,
        shots: usize,
        config: &SimulationConfig,
    ) -> Result<SimulationResult, SimulationError>;

    /// Compute expectation value of an observable.
    fn expectation_value(
        &self,
        circuit: &QuantumCircuit,
        observable: &Observable,
        config: &SimulationConfig,
    ) -> Result<f64, SimulationError>;

    /// Return the backend name for logging.
    fn name(&self) -> &'static str;
}

Users interact through QuantumCircuit and never need to know which backend is active:

let circuit = QuantumCircuit::new(50)
    .h_all()
    .append_qaoa_layer(graph, gamma, beta)
    .measure_all();

// Automatic backend selection
let result = ruqu::execute(&circuit, 1000)?;
// -> Internally selects TensorNetwork backend due to n=50, low treewidth

// Or explicit backend override
let result = ruqu::execute_with_backend(
    &circuit,
    1000,
    Backend::TensorNetwork(TnConfig::default()),
)?;

9. Future: ruvector-mincut Integration for Contraction Ordering

The ruvector-mincut crate currently solves balanced graph partitioning for vector index sharding. The same algorithm directly applies to tensor network contraction ordering via the following correspondence:

Graph partitioning concept TN contraction concept
Vertex Tensor
Edge weight Bond dimension (log2)
Partition Contraction subtree
Edge cut Inter-partition bond cost
Balanced partition Balanced contraction tree

Phase 1 (this ADR): Use ruvector-mincut for community detection in contraction path optimization.

Phase 2 (future): Extend ruvector-mincut with hypergraph partitioning for multi-index tensor contractions, enabling handling of higher-order tensor networks (e.g., PEPS for 2D circuits).

Consequences

Positive

  1. Dramatically expanded qubit range: Shallow circuits on 40-60 qubits become tractable on commodity hardware.
  2. Surface code simulation: Distance-7 surface codes (49 data + 48 ancilla = 97 qubits) can be simulated for decoder validation using MPS (the circuit is geometrically local).
  3. Unified interface: Users write circuits once; backend selection is automatic.
  4. Synergy with ruvector-mincut: Leverages existing graph partitioning investment.
  5. Complementary to state vector: Each backend covers the other's weakness.

Negative

  1. Implementation complexity: Tensor contraction, SVD truncation, and path optimization are non-trivial to implement correctly and efficiently.
  2. Approximation risk: MPS truncation introduces controlled but nonzero error. Users must understand fidelity estimates.
  3. Compilation time: The ndarray and petgraph dependencies add to compile time when the feature is enabled.
  4. Testing surface: Two backends doubles the testing matrix for correctness validation.
  5. Performance unpredictability: Contraction cost depends on circuit structure in ways that are hard to predict without running the path optimizer.

Risks and Mitigations

Risk Likelihood Impact Mitigation
Path optimizer finds poor ordering Medium High cost Multiple heuristics + timeout fallback to greedy
MPS fidelity silently degrades Medium Incorrect results Track discarded weight; warn if fidelity < 0.99
Feature interaction bugs Low Incorrect results Shared test suite: both backends must agree on small circuits
Memory spike during contraction Medium OOM Pre-estimate peak intermediate tensor size; abort if too large

References

  • QuantRS2 tensor network implementation: internal reference
  • Markov & Shi, "Simulating Quantum Computation by Contracting Tensor Networks" (2008)
  • Gray & Kourtis, "Hyper-optimized tensor network contraction" (2021) -- cotengra
  • Schollwock, "The density-matrix renormalization group in the age of matrix product states" (2011)
  • ADR-QE-001: Core Engine Architecture (state vector backend)
  • ADR-QE-005: WASM Compilation Target
  • ruvector-mincut crate documentation
  • ADR-014: Coherence Engine (graph partitioning reuse)