Skip to content

Latest commit

 

History

History
628 lines (526 loc) · 22.9 KB

File metadata and controls

628 lines (526 loc) · 22.9 KB

ADR-QE-011: Memory Gating & Power Management

Status: Proposed Date: 2026-02-06 Authors: ruv.io, RuVector Team Deciders: Architecture Review Board


Context

ruVector is designed to operate within the Cognitum computing paradigm: a tile-based architecture with 256 low-power processor cores, event-driven activation, and aggressive power gating. Agents (software components) remain fully dormant until an event triggers their activation. Once their work completes, they release all resources and return to dormancy.

The quantum simulation engine must adhere to this model:

  1. Zero idle footprint: When no simulation is running, the engine consumes zero CPU cycles and zero heap memory beyond its compiled code and static data.
  2. Rapid activation: The engine must be ready to execute a simulation within microseconds of receiving a request.
  3. Prompt resource release: Upon simulation completion (or failure), all allocated memory is immediately freed.
  4. Predictable memory: Callers must be able to determine exact memory requirements before committing to a simulation.

Memory Scale

The state vector for n qubits requires 2^n complex amplitudes, each consuming 16 bytes (two f64 values):

Qubits Amplitudes Memory Notes
10 1,024 16 KiB Trivial
15 32,768 512 KiB Small
20 1,048,576 16 MiB Moderate
25 33,554,432 512 MiB Large
28 268,435,456 4 GiB Needs dedicated memory
30 1,073,741,824 16 GiB Workstation-class
32 4,294,967,296 64 GiB Server-class
35 34,359,738,368 512 GiB HPC
40 1,099,511,627,776 16 TiB Infeasible (state vector)

Each additional qubit doubles memory. This exponential scaling makes memory the primary resource constraint and the most important resource to manage.

Edge and Embedded Constraints

On edge devices (embedded ruVector nodes, IoT gateways, mobile processors), memory is severely limited:

Platform Typical RAM Max qubits (state vector)
Cognitum tile (single) 256 MiB 23
Cognitum tile cluster (4) 1 GiB 25
Raspberry Pi 4 8 GiB 28
Mobile device 4-6 GiB 27-28 (with other apps)
Laptop 16-64 GiB 29-31
Server 256-512 GiB 33-34

WASM Memory Model

WebAssembly uses a linear memory that can grow but cannot shrink. Once a large simulation allocates pages, those pages remain mapped until the WASM instance is destroyed. This is a fundamental platform limitation that must be documented and accounted for.

Decision

1. Zero-Idle Footprint Architecture

The quantum engine is implemented as a pure library with no runtime overhead:

// The engine is a collection of functions and types.
// No background threads, no event loops, no persistent state.
// When not called, it consumes exactly zero CPU and zero heap.

pub struct QuantumEngine;  // Zero-sized type; purely a namespace

impl QuantumEngine {
    /// Execute a simulation. All resources are allocated on entry
    /// and freed on exit (or on error).
    pub fn execute(
        circuit: &QuantumCircuit,
        shots: usize,
        config: &SimulationConfig,
    ) -> Result<SimulationResult, SimulationError> {
        // 1. Estimate and validate memory
        let required = Self::estimate_memory(circuit.num_qubits());
        Self::validate_memory_available(required)?;

        // 2. Allocate state vector (the big allocation)
        let mut state = Self::allocate_state(circuit.num_qubits())?;

        // 3. Execute gates (all computation happens here)
        Self::apply_gates(circuit, &mut state, config)?;

        // 4. Measure (if requested)
        let measurements = Self::measure(&state, shots)?;

        // 5. Build result (copies out what we need)
        let result = SimulationResult::from_state_and_measurements(
            &state, measurements, circuit,
        );

        // 6. state is dropped here -- Vec<Complex<f64>> deallocated
        //    No cleanup needed. No finalizers. Just drop.

        Ok(result)
    }
    // state goes out of scope and is deallocated by Rust's ownership system
}

Key properties:

  • No new() or init() methods that create persistent state.
  • No Drop impl with complex cleanup logic.
  • No Arc, Mutex, or shared state between calls.
  • Each call is fully independent and self-contained.

2. On-Demand Allocation Strategy

State vectors are allocated at simulation start and freed at simulation end:

fn allocate_state(n_qubits: u32) -> Result<StateVector, SimulationError> {
    let num_amplitudes = 1_usize.checked_shl(n_qubits)
        .ok_or(SimulationError::QubitLimitExceeded {
            requested: n_qubits,
            maximum: (usize::BITS - 1) as u32,
            estimated_memory_bytes: u64::MAX,
            available_memory_bytes: estimate_available_memory() as u64,
        })?;

    let required_bytes = num_amplitudes
        .checked_mul(std::mem::size_of::<Complex<f64>>())
        .ok_or(SimulationError::MemoryAllocationFailed {
            requested_bytes: u64::MAX,
            qubit_count: n_qubits,
            suggestion: "Qubit count exceeds addressable memory",
        })?;

    // Attempt allocation. Rust's global allocator will return an error
    // (with #[global_allocator] configured) or the OS will OOM-kill us.
    // We use try_reserve to handle this gracefully.
    let mut amplitudes = Vec::new();
    amplitudes.try_reserve_exact(num_amplitudes)
        .map_err(|_| SimulationError::MemoryAllocationFailed {
            requested_bytes: required_bytes as u64,
            qubit_count: n_qubits,
            suggestion: "Reduce qubit count or use tensor-network backend",
        })?;

    // Initialize to |00...0> state
    amplitudes.resize(num_amplitudes, Complex::new(0.0, 0.0));
    amplitudes[0] = Complex::new(1.0, 0.0);

    Ok(StateVector { amplitudes, n_qubits })
}

The allocation sequence:

  IDLE (zero memory)
    |
    v
  estimate_memory(n) --> returns bytes needed
    |
    v
  validate_memory_available(bytes) --> checks against OS/platform limits
    |                                   returns Err if insufficient
    v
  Vec::try_reserve_exact(2^n) --> attempts allocation
    |                              returns Err on failure (no panic)
    v
  ALLOCATED (2^n * 16 bytes on heap)
    |
    v
  [... simulation runs ...]
    |
    v
  Vec::drop() --> automatic deallocation
    |
    v
  IDLE (zero memory)

3. Memory Estimation API

Callers can query exact memory requirements before committing:

/// Returns the number of bytes required to simulate n_qubits.
/// This accounts for the state vector plus working memory for
/// gate application (temporary buffers, measurement arrays, etc.).
///
/// # Returns
/// - `Ok(bytes)` if the qubit count is representable
/// - `Err(...)` if 2^n_qubits overflows usize
pub fn estimate_memory(n_qubits: u32) -> Result<MemoryEstimate, SimulationError> {
    let num_amplitudes = 1_usize.checked_shl(n_qubits)
        .ok_or(SimulationError::QubitLimitExceeded {
            requested: n_qubits,
            maximum: (usize::BITS - 1) as u32,
            estimated_memory_bytes: u64::MAX,
            available_memory_bytes: 0,
        })?;

    let state_vector_bytes = num_amplitudes * std::mem::size_of::<Complex<f64>>();

    // Working memory: temporary buffer for gate application (1 amplitude slice)
    // Plus measurement result storage
    let working_bytes = num_amplitudes * std::mem::size_of::<Complex<f64>>() / 4;

    // Thread-local scratch space (per Rayon thread)
    let thread_count = rayon::current_num_threads();
    let scratch_per_thread = 64 * 1024; // 64 KiB per thread for local buffers
    let thread_scratch = thread_count * scratch_per_thread;

    Ok(MemoryEstimate {
        state_vector_bytes: state_vector_bytes as u64,
        working_bytes: working_bytes as u64,
        thread_scratch_bytes: thread_scratch as u64,
        total_bytes: (state_vector_bytes + working_bytes + thread_scratch) as u64,
        num_amplitudes: num_amplitudes as u64,
    })
}

#[derive(Debug, Clone)]
pub struct MemoryEstimate {
    /// Bytes for the state vector (dominant cost).
    pub state_vector_bytes: u64,
    /// Bytes for gate-application working memory.
    pub working_bytes: u64,
    /// Bytes for thread-local scratch space.
    pub thread_scratch_bytes: u64,
    /// Total estimated bytes.
    pub total_bytes: u64,
    /// Number of complex amplitudes.
    pub num_amplitudes: u64,
}

impl MemoryEstimate {
    /// Returns true if the estimate fits within the given byte budget.
    pub fn fits_in(&self, available_bytes: u64) -> bool {
        self.total_bytes <= available_bytes
    }

    /// Suggest the maximum qubits for a given memory budget.
    pub fn max_qubits_for(available_bytes: u64) -> u32 {
        // Each qubit doubles memory; find largest n where 20 * 2^n <= available
        // Factor of 20 accounts for 16-byte amplitudes + 25% working memory
        let effective = available_bytes / 20;
        if effective == 0 { return 0; }
        (effective.ilog2()) as u32
    }
}

4. Allocation Failure Handling

The engine never panics on allocation failure. All paths return structured errors:

// Pattern: every allocation is fallible and returns a descriptive error.

// State vector allocation failure:
SimulationError::MemoryAllocationFailed {
    requested_bytes: 17_179_869_184,  // 16 GiB
    qubit_count: 30,
    suggestion: "Reduce qubit count by 2 (to 28, ~4 GiB) or enable tensor-network backend",
}

// Integer overflow (qubit count too large):
SimulationError::QubitLimitExceeded {
    requested: 64,
    maximum: 33,  // based on available memory
    estimated_memory_bytes: u64::MAX,
    available_memory_bytes: 68_719_476_736,  // 64 GiB
}

Decision tree on allocation failure:

  Memory allocation failed
    |
    +-- Is tensor-network feature enabled?
    |     |
    |     +-- YES: Suggest tensor-network backend
    |     |         (may work if circuit has low treewidth)
    |     |
    |     +-- NO: Suggest reducing qubit count
    |             Calculate: max_qubits = floor(log2(available / 20))
    |             Suggest: "Reduce to {max_qubits} qubits ({memory} bytes)"
    |
    +-- Is the request wildly over budget (>100x)?
    |     |
    |     +-- YES: "Circuit requires {X} GiB but only {Y} MiB available"
    |     |
    |     +-- NO: "Circuit requires {X} GiB, {Y} GiB available.
    |              Reducing by {delta} qubits would fit."
    |
    +-- Return SimulationError (no panic, no abort)

5. CPU Yielding for Long Simulations

For simulations estimated to exceed 100ms, the engine can optionally yield between gate batches to allow the OS scheduler to manage power states:

pub struct YieldConfig {
    /// Enable cooperative yielding between gate batches.
    /// Default: false (maximum throughput).
    pub enabled: bool,

    /// Number of gates to apply before yielding.
    /// Default: 1000.
    pub gates_per_slice: usize,

    /// Yield mechanism.
    /// Default: ThreadYield (std::thread::yield_now).
    pub yield_strategy: YieldStrategy,
}

pub enum YieldStrategy {
    /// Call std::thread::yield_now() between slices.
    ThreadYield,
    /// Sleep for specified duration between slices.
    Sleep(Duration),
    /// Call a user-provided callback between slices.
    Callback(Box<dyn Fn(SliceProgress) + Send>),
}

pub struct SliceProgress {
    pub gates_completed: u64,
    pub gates_remaining: u64,
    pub elapsed: Duration,
    pub estimated_remaining: Duration,
}

// Usage in gate application loop:
fn apply_gates_with_yield(
    circuit: &QuantumCircuit,
    state: &mut StateVector,
    yield_config: &YieldConfig,
) -> Result<(), SimulationError> {
    let gates = circuit.gates();

    for (i, gate) in gates.iter().enumerate() {
        apply_single_gate(gate, state)?;

        if yield_config.enabled && (i + 1) % yield_config.gates_per_slice == 0 {
            match &yield_config.yield_strategy {
                YieldStrategy::ThreadYield => std::thread::yield_now(),
                YieldStrategy::Sleep(d) => std::thread::sleep(*d),
                YieldStrategy::Callback(cb) => cb(SliceProgress {
                    gates_completed: (i + 1) as u64,
                    gates_remaining: (gates.len() - i - 1) as u64,
                    elapsed: start.elapsed(),
                    estimated_remaining: estimate_remaining(i, gates.len(), start),
                }),
            }
        }
    }

    Ok(())
}

Yield is disabled by default to maximize throughput. It is primarily intended for:

  • Edge devices where power management is critical.
  • Interactive applications where UI responsiveness matters.
  • Long-running simulations (>1 second) where progress reporting is needed.

6. Thread Management

The quantum engine does not create or manage its own threads:

  +-----------------------------------------------+
  |              Global Rayon Thread Pool          |
  |  (shared by all ruVector subsystems)          |
  |                                                |
  |  [Thread 0] [Thread 1] ... [Thread N-1]       |
  |     ^           ^              ^               |
  |     |           |              |               |
  |  +--+---+   +--+---+      +---+--+            |
  |  | ruQu |   | ruQu |      | idle |            |
  |  | gate  |   | gate |      |      |            |
  |  | apply |   | apply|      |      |            |
  |  +-------+   +------+      +------+            |
  |                                                |
  |  During simulation: threads work on gates      |
  |  After simulation: threads return to pool      |
  |  Pool idle: OS can power-gate cores            |
  +-----------------------------------------------+

Key properties:

  • Rayon's global thread pool is initialized once by ruvector-core at startup.
  • The quantum engine calls rayon::par_iter() and related APIs, borrowing threads temporarily.
  • When simulation completes, all threads are returned to the global pool.
  • If no ruVector work is pending, Rayon threads park (blocking on a condvar), consuming zero CPU. The OS can then power-gate the underlying cores.

7. WASM Memory Considerations

WebAssembly linear memory has a specific behavior that affects resource management:

  WASM Memory Layout
  +------------------+------------------+
  |  Initial pages   |  Grown pages     |
  |  (compiled size) |  (runtime alloc) |
  +------------------+------------------+
  0                  initial_size       current_size

  Growth: memory.grow(delta_pages) -> adds pages to the end
  Shrink: NOT SUPPORTED in WASM spec

  After 25-qubit simulation:
  +------------------+----------------------------------+
  |  Initial (1 MiB) |  Grown for state vec (512 MiB)  |  <- HIGH WATER MARK
  +------------------+----------------------------------+

  After simulation completes:
  +------------------+----------------------------------+
  |  Initial (1 MiB) |  FREED internally but pages      |
  |                   |  still mapped (512 MiB virtual)  |
  +------------------+----------------------------------+
  The Rust allocator returns memory to its free list,
  but WASM pages are not returned to the host.

Implications and mitigations:

  1. Document the behavior: Users must understand that WASM memory is a high-water mark. A 25-qubit simulation permanently increases the WASM instance's memory footprint to ~512 MiB.

  2. Instance recycling: For applications that run multiple simulations, create a new WASM instance periodically to reset the memory high-water mark.

  3. Memory budget enforcement: The WASM host can set WebAssembly.Memory with a maximum parameter to cap growth:

const memory = new WebAssembly.Memory({
    initial: 16,      // 1 MiB
    maximum: 8192,     // 512 MiB cap
});
  1. Pre-check in WASM: The engine's estimate_memory() function works in WASM and should be called before simulation to verify the allocation will succeed.

8. Cognitum Tile Integration

On Cognitum's tile-based architecture, the quantum engine maps to tiles as follows:

  Cognitum Processor (256 tiles)
  +--------+--------+--------+--------+
  | Tile 0 | Tile 1 | Tile 2 | Tile 3 |  <- Assigned to quantum sim
  | ACTIVE | ACTIVE | ACTIVE | ACTIVE |
  +--------+--------+--------+--------+
  | Tile 4 | Tile 5 | Tile 6 | Tile 7 |  <- Other ruVector work (or sleeping)
  | sleep  | vecDB  | sleep  | sleep  |
  +--------+--------+--------+--------+
  |  ...   |  ...   |  ...   |  ...   |
  | sleep  | sleep  | sleep  | sleep  |  <- Power gated (zero consumption)
  +--------+--------+--------+--------+

Power state diagram for a quantum simulation lifecycle:

  State: ALL_TILES_IDLE
    |
    | Simulation request arrives
    v
  State: ALLOCATING
    Action: Wake tiles 0-3 (or however many are needed)
    Action: Allocate state vector across tile-local memory
    Power: Tiles 0-3 ACTIVE, rest SLEEP
    |
    v
  State: SIMULATING
    Action: Apply gates in parallel across active tiles
    Power: Tiles 0-3 at full clock rate
    Duration: microseconds to seconds depending on circuit
    |
    v
  State: MEASURING
    Action: Sample measurement outcomes
    Power: Tile 0 only (measurement is sequential)
    |
    v
  State: DEALLOCATING
    Action: Free state vector
    Action: Return tiles to idle pool
    |
    v
  State: ALL_TILES_IDLE
    Power: Tiles 0-3 back to SLEEP
    Memory: Zero heap allocation

Tile assignment policy:

  • Small simulations (n <= 20): 1 tile sufficient.
  • Medium simulations (20 < n <= 25): 2-4 tiles for parallel gate application.
  • Large simulations (25 < n <= 30): All available tiles.
  • The tile scheduler (part of Cognitum runtime) handles assignment. The quantum engine simply uses Rayon parallelism; the runtime maps Rayon threads to tiles.

9. Memory Budget Table

Quick reference for capacity planning:

Qubits State Vector Working Memory Total Platform Fit
10 16 KiB 4 KiB 20 KiB Any
12 64 KiB 16 KiB 80 KiB Any
14 256 KiB 64 KiB 320 KiB Any
16 1 MiB 256 KiB 1.3 MiB Any
18 4 MiB 1 MiB 5 MiB Any
20 16 MiB 4 MiB 20 MiB Any
22 64 MiB 16 MiB 80 MiB Cognitum single tile
24 256 MiB 64 MiB 320 MiB Cognitum 2+ tiles
26 1 GiB 256 MiB 1.3 GiB Cognitum cluster
28 4 GiB 1 GiB 5 GiB Laptop / RPi 8GB
30 16 GiB 4 GiB 20 GiB Workstation
32 64 GiB 16 GiB 80 GiB Server
34 256 GiB 64 GiB 320 GiB Large server

10. Allocation and Deallocation Sequence Diagram

  Caller                Engine                  OS/Allocator
    |                     |                         |
    |  execute(circuit)   |                         |
    |-------------------->|                         |
    |                     |                         |
    |                     |  estimate_memory(n)     |
    |                     |  validate_available()   |
    |                     |                         |
    |                     |  try_reserve_exact(2^n) |
    |                     |------------------------>|
    |                     |                         |
    |                     |     Ok(ptr) or Err      |
    |                     |<------------------------|
    |                     |                         |
    |                     |  [if Err: return        |
    |                     |   SimulationError]      |
    |                     |                         |
    |                     |  initialize |00...0>    |
    |                     |  apply gates            |
    |                     |  measure                |
    |                     |                         |
    |                     |  build result           |
    |                     |  (copies measurements,  |
    |                     |   expectation values)   |
    |                     |                         |
    |                     |  drop(state_vector)     |
    |                     |------------------------>|
    |                     |                         |  free(ptr, 2^n * 16)
    |                     |                         |
    |  Ok(result)         |                         |
    |<--------------------|                         |
    |                     |                         |
    |  [Engine holds ZERO |                         |
    |   heap memory now]  |                         |

Consequences

Positive

  1. True zero-idle cost: No background resource consumption. Perfectly aligned with Cognitum's event-driven architecture and power gating.
  2. Predictable memory: estimate_memory() gives exact requirements before committing, preventing OOM surprises.
  3. Graceful degradation: Allocation failures return structured errors with actionable suggestions, never panics.
  4. Platform portable: The same allocation strategy works on native (Linux, macOS, Windows), WASM, and embedded (Cognitum tiles).
  5. No resource leaks: Rust's ownership system guarantees deallocation on all exit paths (success, error, panic).

Negative

  1. No state caching: Each simulation allocates and deallocates independently. Repeated simulations on the same qubit count pay allocation cost each time. Mitigation: allocation is O(2^n) but fast compared to O(G * 2^n) simulation.
  2. WASM memory high-water mark: Cannot reclaim WASM linear memory pages. Documented as a platform limitation with instance-recycling workaround.
  3. No memory pooling: Could theoretically amortize allocation across simulations, but this conflicts with the zero-idle-footprint requirement.
  4. Yield overhead: When enabled, cooperative yielding adds per-slice overhead. Mitigated by making it opt-in and configurable.

Risks and Mitigations

Risk Likelihood Impact Mitigation
OOM despite estimate_memory check Low Crash Check returns conservative estimate including working memory
WASM instance runs out of address space Medium Failure Set WebAssembly.Memory maximum; document limitation
Allocation latency spike (OS page faults) Medium Slow start Consider madvise / mlock hints for large allocations
Rayon thread pool contention Medium Degraded perf Quantum engine yields between slices; Rayon work-stealing handles contention

References