Algorithmic Optimization Analysis: Mincut-Gated Transformer

Analysis Date: 2025-12-26 Crate: /home/user/ruvector/crates/ruvector-mincut-gated-transformer Focus Files: spectral.rs, sparse_attention.rs, early_exit.rs, mod_routing.rs

Executive Summary

Found 11 high-impact optimization opportunities with potential for:

90% reduction in eigenvector computation time (sparse matrices)
50% reduction in sparse attention mask building (hash-based deduplication)
60% reduction in top-k computation (heap-based selection)
Elimination of redundant lambda stability calculations

1. src/spectral.rs - Eigenvector Computation

CRITICAL: Sparse Matrix Representation (O(n²) → O(E))

File: src/spectral.rs Lines: 318-326, 350-356

Issue: Graph Laplacian is treated as dense matrix (n×n), but it's inherently sparse (only edges have non-zero values).

// CURRENT: O(n²) per iteration
for i in 0..n {
    let mut sum = 0.0f32;
    for j in 0..n {
        sum += matrix[i * n + j] * v[j];  // ← Iterates all n² entries
    }
    v_new[i] = sum;
}

Expected Complexity:

Current: O(k × iters × n²) for k eigenvectors
Optimized: O(k × iters × E) where E = number of edges

Optimization:

// OPTIMIZED: CSR (Compressed Sparse Row) format
struct SparseMatrix {
    row_ptr: Vec<usize>,    // Size: n+1
    col_idx: Vec<usize>,    // Size: nnz (non-zeros)
    values: Vec<f32>,       // Size: nnz
}

// O(E) matrix-vector multiplication
fn sparse_matvec(matrix: &SparseMatrix, v: &[f32], result: &mut [f32]) {
    for i in 0..matrix.row_ptr.len() - 1 {
        let mut sum = 0.0;
        for j in matrix.row_ptr[i]..matrix.row_ptr[i + 1] {
            sum += matrix.values[j] * v[matrix.col_idx[j]];
        }
        result[i] = sum;
    }
}

Impact: For typical graphs with E << n², this is 10-100x faster.

Example: For n=1000 tokens, E=5000 edges:

Dense: 1M operations per iteration
Sparse: 5K operations per iteration (200x speedup)

HIGH: Deflation Algorithm Inefficiency (O(k×n²) → O(k×n×iters))

File: src/spectral.rs Lines: 176-184

Issue: Computing k eigenvectors using deflation requires k separate power iterations with matrix updates.

// CURRENT: Deflate after each eigenvector
for _ in 0..k {
    let evec = power_iteration(&shifted, n, 100);
    let eigenvalue = rayleigh_quotient(&shifted, n, &evec);

    // O(n²) deflation: A := A - λ * v * v^T
    for i in 0..n {
        for j in 0..n {
            shifted[i * n + j] -= eigenvalue * evec[i] * evec[j];  // ← Full matrix update
        }
    }
}

Optimization: Use Lanczos algorithm instead of deflated power iteration.

Algorithm:

// Lanczos tridiagonalization: O(m × E) where m = Lanczos steps
// Produces tridiagonal matrix T that captures dominant eigenspace
// Then solve T's eigenvalues/eigenvectors (O(m³) but m << n)

fn lanczos_eigenvectors(laplacian_edges: &[(u16, u16)], n: usize, k: usize) -> Vec<Vec<f32>> {
    const M: usize = 50; // Lanczos iterations (tune based on k)
    let m = (k * 3).min(M);

    // Build tridiagonal matrix via Lanczos
    let (alpha, beta) = lanczos_tridiagonalize(laplacian_edges, n, m);

    // Solve small tridiagonal eigenvalue problem: O(m³)
    let (evals, evecs_small) = tridiag_eigen(&alpha, &beta, k);

    // Project back to full space: O(m × n)
    project_eigenvectors(&evecs_small, n, k)
}

Expected Complexity:

Current: O(k × iters × n²) = O(k × 100 × n²)
Lanczos: O(m × E + m³) ≈ O(50 × E + 50³) where m ≈ 3k

Impact: For n=500, k=8, E=2500:

Current: 8 × 100 × 250K = 200M operations
Lanczos: 50 × 2.5K + 125K = 250K operations (800x speedup)

Mathematical Foundation: Lanczos method from Golub & Van Loan "Matrix Computations" (3rd ed, §9.3).

MEDIUM: Redundant Matrix-Vector Product

File: src/spectral.rs Lines: 173, 177, 350-356

Issue: rayleigh_quotient recomputes A×v even though it was just computed in the final power iteration.

// Line 173: Last iteration computes A×v
let evec = power_iteration(&shifted, n, 100);  // ← Computes A×v internally

// Line 177: Immediately recomputes A×v
let eigenvalue = rayleigh_quotient(&shifted, n, &evec);  // ← Redundant A×v

Optimization: Return both eigenvector and A×v from power iteration.

fn power_iteration_with_av(matrix: &[f32], n: usize, num_iters: u16)
    -> (Vec<f32>, Vec<f32>) // Returns (v, A×v)
{
    // ... iterations ...

    // Last iteration: compute and save A×v
    let mut av = vec![0.0f32; n];
    for i in 0..n {
        let mut sum = 0.0;
        for j in 0..n {
            sum += matrix[i * n + j] * v[j];
        }
        av[i] = sum;
    }

    // Normalize v
    let norm: f32 = av.iter().map(|x| x * x).sum::<f32>().sqrt();
    for x in &mut av { *x /= norm; }

    (v, av)
}

// Rayleigh quotient without recomputation
fn rayleigh_quotient_cached(v: &[f32], av: &[f32]) -> f32 {
    let numerator: f32 = v.iter().zip(av.iter()).map(|(vi, avi)| vi * avi).sum();
    let denominator: f32 = v.iter().map(|vi| vi * vi).sum();
    numerator / denominator
}

Impact: Saves one full matrix-vector product per eigenvector (O(n²) → O(1)).

LOW: Normalized Laplacian Computation

File: src/spectral.rs Lines: 122-128

Issue: Iterates over all n² matrix entries when most are zero.

// CURRENT: O(n²)
for i in 0..n {
    for j in 0..n {
        laplacian[i * n + j] *= degree_sqrt_inv[i] * degree_sqrt_inv[j];
    }
}

Optimization: Only normalize non-zero entries (edges + diagonal).

// OPTIMIZED: O(E)
for &(u, v) in boundary_edges {
    let u = u as usize;
    let v = v as usize;
    if u < n && v < n {
        laplacian[u * n + v] *= degree_sqrt_inv[u] * degree_sqrt_inv[v];
        laplacian[v * n + u] *= degree_sqrt_inv[v] * degree_sqrt_inv[u];
    }
}
for i in 0..n {
    laplacian[i * n + i] *= degree_sqrt_inv[i] * degree_sqrt_inv[i];
}

Impact: O(n²) → O(E), typically 10-50x faster.

2. src/sparse_attention.rs - Sparse Attention Patterns

HIGH: O(n) Lookup in can_attend

File: src/sparse_attention.rs Line: 128

Issue: Linear search in positions vector.

pub fn can_attend(&self, query_pos: u16, key_pos: u16) -> bool {
    self.positions.contains(&(query_pos, key_pos))  // ← O(n) linear search
}

Optimization: Use HashSet or sorted positions with binary search.

use std::collections::HashSet;

pub struct SparseMask {
    pub positions: Vec<(u16, u16)>,
    position_set: HashSet<(u16, u16)>,  // ← Add HashSet for O(1) lookup
    // ... rest of fields
}

#[inline]
pub fn can_attend(&self, query_pos: u16, key_pos: u16) -> bool {
    self.position_set.contains(&(query_pos, key_pos))  // ← O(1) lookup
}

Alternative (allocation-free): Keep positions sorted and use binary search.

#[inline]
pub fn can_attend(&self, query_pos: u16, key_pos: u16) -> bool {
    self.positions.binary_search(&(query_pos, key_pos)).is_ok()  // O(log n)
}

Impact: O(n) → O(1) or O(log n), critical if can_attend is called frequently.

CRITICAL: O(n²) Duplicate Detection in build_sparse_positions

File: src/sparse_attention.rs Lines: 397-424

Issue: Using contains in nested loops creates O(n²) complexity.

// Lines 401-404
let pos = (boundary_token, prev_boundary);
if !positions.contains(&pos) {  // ← O(n) search
    positions.push(pos);         // ← Inside loop
}

// Lines 415-419 (similar pattern)
if !positions.contains(&pos) {  // ← O(n) search in nested loop
    positions.push(pos);
}

Expected Complexity: O(boundary_tokens² × positions.len()) ≈ O(n²) worst case

Optimization: Use HashSet for deduplication, then convert to Vec.

fn build_sparse_positions(
    &self,
    seq_len: usize,
    boundaries: &[u16],
    boundary_tokens: &[u16],
    _target_density: f32,
    _gate: &GatePacket,
) -> Vec<(u16, u16)> {
    use std::collections::HashSet;
    let mut position_set = HashSet::new();  // ← O(1) insert/lookup

    // 1. Intra-partition attention
    if self.config.intra_partition_attention {
        for (partition_idx, &start) in boundaries.iter().enumerate() {
            let end = if partition_idx + 1 < boundaries.len() {
                boundaries[partition_idx + 1] as usize
            } else {
                seq_len
            };

            for i in start as usize..end {
                for j in start as usize..=i {
                    position_set.insert((i as u16, j as u16));  // ← O(1) average
                }
            }
        }
    }

    // 2. Boundary cross-partition attention
    if self.config.boundary_cross_attention {
        for &boundary_token in boundary_tokens {
            for &prev_boundary in boundary_tokens {
                if prev_boundary <= boundary_token {
                    position_set.insert((boundary_token, prev_boundary));
                }
            }

            let window = 4;
            for offset in 0..window {
                let token_pos = boundary_token + offset;
                if (token_pos as usize) < seq_len {
                    for &prev_boundary in boundary_tokens {
                        if prev_boundary <= token_pos {
                            position_set.insert((token_pos, prev_boundary));
                        }
                    }
                }
            }
        }
    }

    position_set.into_iter().collect()
}

Expected Complexity: O(P + B²) where P = partition positions, B = boundary tokens Previous Complexity: O(P + B² × n) where n = average positions.len()

Impact: For seq_len=512, boundary_tokens=20:

Current: ~20K contains checks ≈ 10M comparisons worst case
Optimized: ~20K inserts ≈ 20K operations (500x speedup)

MEDIUM: Inefficient Query Grouping

File: src/sparse_attention.rs Lines: 235-238

Issue: Creates separate Vec for each query position.

// Group positions by query
let mut positions_by_query: Vec<Vec<u16>> = vec![Vec::new(); seq_len];
for &(query_pos, key_pos) in &mask.positions {
    positions_by_query[query_pos as usize].push(key_pos);
}

Optimization: Sort positions once, use slice ranges.

// Sort positions by query: O(m log m) where m = positions.len()
let mut sorted_positions = mask.positions.clone();
sorted_positions.sort_unstable_by_key(|&(q, _)| q);

// Compute attention for each query using binary search for ranges
let mut pos_idx = 0;
for query_pos in 0..seq_len {
    // Find range of positions for this query: O(log m)
    let start = pos_idx;
    while pos_idx < sorted_positions.len() && sorted_positions[pos_idx].0 == query_pos as u16 {
        pos_idx += 1;
    }
    let key_positions = &sorted_positions[start..pos_idx];

    if key_positions.is_empty() {
        continue;
    }

    // ... rest of attention computation
}

Impact:

Memory: seq_len allocations eliminated
Time: O(m log m) sort once vs O(seq_len) allocations + O(m) inserts

3. src/early_exit.rs - Early Exit Decision Logic

MEDIUM: Redundant Lambda Stability Calculation

File: src/early_exit.rs Lines: 305-310, 341-347

Issue: Same calculation performed in two places.

// Line 305-310: In calculate_adaptive_exit_layer
let lambda_delta_abs = gate.lambda_delta().abs() as u32;
let stability = if gate.lambda_prev > 0 {
    let ratio = (lambda_delta_abs * 32768) / gate.lambda_prev.max(1);
    32768u32.saturating_sub(ratio).min(32767) as u16
} else { 0 };

// Line 341-347: In evaluate_exit_conditions (EXACT SAME CODE)
let lambda_delta_abs = gate.lambda_delta().abs() as u32;
let stability = if gate.lambda_prev > 0 {
    let ratio = (lambda_delta_abs * 32768) / gate.lambda_prev.max(1);
    32768u32.saturating_sub(ratio).min(32767) as u16
} else { 0 };

Optimization: Extract to method, compute once.

impl GatePacket {
    /// Calculate lambda stability in Q15 format (0-32767)
    /// Higher values = more stable
    #[inline]
    pub fn lambda_stability_q15(&self) -> u16 {
        let lambda_delta_abs = self.lambda_delta().abs() as u32;
        if self.lambda_prev > 0 {
            let ratio = (lambda_delta_abs * 32768) / self.lambda_prev.max(1);
            32768u32.saturating_sub(ratio).min(32767) as u16
        } else {
            0
        }
    }
}

// Usage:
let stability = gate.lambda_stability_q15();

Impact: Eliminates redundant computation, improves maintainability.

HIGH: O(n log n) Top-K using Full Sort

File: src/early_exit.rs Lines: 420-428

Issue: Sorts entire logits array to find top-k elements.

fn topk(&self, logits: &[i32], k: usize) -> Vec<usize> {
    if logits.is_empty() || k == 0 {
        return Vec::new();
    }

    let mut indexed: Vec<(usize, i32)> = logits.iter().copied().enumerate().collect();
    indexed.sort_by(|a, b| b.1.cmp(&a.1));  // ← O(n log n) for top k elements

    indexed.iter().take(k).map(|(idx, _)| *idx).collect()
}

Expected Complexity: O(n log n) Optimal Complexity: O(n + k log k)

Optimization: Use heap-based selection or partial quickselect.

use std::collections::BinaryHeap;
use std::cmp::Reverse;

fn topk(&self, logits: &[i32], k: usize) -> Vec<usize> {
    if logits.is_empty() || k == 0 {
        return Vec::new();
    }

    if k >= logits.len() {
        // All elements: O(n log n)
        let mut indexed: Vec<_> = logits.iter().copied().enumerate().collect();
        indexed.sort_unstable_by(|a, b| b.1.cmp(&a.1));
        return indexed.into_iter().map(|(idx, _)| idx).collect();
    }

    // Min-heap of size k: O(n log k)
    let mut heap = BinaryHeap::with_capacity(k);

    for (idx, &val) in logits.iter().enumerate() {
        if heap.len() < k {
            heap.push(Reverse((val, idx)));
        } else if let Some(&Reverse((min_val, _))) = heap.peek() {
            if val > min_val {
                heap.pop();
                heap.push(Reverse((val, idx)));
            }
        }
    }

    heap.into_iter()
        .map(|Reverse((_, idx))| idx)
        .collect()
}

Expected Complexity: O(n log k) vs O(n log n)

Impact: For n=50K vocabulary, k=5:

Current: O(50K × log(50K)) ≈ 800K operations
Optimized: O(50K × log(5)) ≈ 116K operations (7x speedup)

Alternative (allocation-free): select_nth_unstable_by for O(n) average case:

fn topk(&self, logits: &[i32], k: usize) -> Vec<usize> {
    let mut indexed: Vec<_> = logits.iter().copied().enumerate().collect();

    if k >= indexed.len() {
        indexed.sort_unstable_by(|a, b| b.1.cmp(&a.1));
    } else {
        // Partition to find k-th largest: O(n) average
        indexed.select_nth_unstable_by(k, |a, b| b.1.cmp(&a.1));
        // Sort only the top k: O(k log k)
        indexed[..k].sort_unstable_by(|a, b| b.1.cmp(&a.1));
    }

    indexed.iter().take(k).map(|(idx, _)| *idx).collect()
}

Complexity: O(n + k log k) average case.

4. src/mod_routing.rs - Mixture-of-Depths Routing

LOW: Mark Boundary Tokens - Minor Optimization

File: src/mod_routing.rs Lines: 279-287

Issue: step_by with stride.max(1) when stride could be 0.

let stride = routes.len() / boundary_count.max(1);
for i in (0..routes.len()).step_by(stride.max(1)) {  // ← Redundant max(1)

Optimization: Guard earlier.

let stride = (routes.len() / boundary_count.max(1)).max(1);
for i in (0..routes.len()).step_by(stride) {
    // ...
}

Impact: Micro-optimization, eliminates one comparison per iteration.

Summary of Optimizations

File	Line	Issue	Current	Optimized	Speedup
spectral.rs	318-326	Dense matrix-vector	O(n²)	O(E)	10-200x
spectral.rs	176-184	Deflation	O(k×100×n²)	O(50×E)	100-800x
spectral.rs	173,177	Redundant A×v	2×O(n²)	O(n²)	2x
spectral.rs	122-128	Dense normalization	O(n²)	O(E)	10-50x
sparse_attention.rs	128	Linear lookup	O(n)	O(1) or O(log n)	n or log n
sparse_attention.rs	397-424	Duplicate check	O(n²)	O(n)	500x
sparse_attention.rs	235-238	Query grouping	O(m) allocs	O(m log m)	Memory + cache
early_exit.rs	305,341	Redundant calc	2× compute	1× compute	2x
early_exit.rs	420-428	Full sort for top-k	O(n log n)	O(n log k)	7x

Implementation Priority

Phase 1: Critical Path (High Impact, Low Risk)

✅ Sparse matrix representation (spectral.rs) - Highest impact
✅ HashSet deduplication (sparse_attention.rs:397-424)
✅ Heap-based top-k (early_exit.rs:420-428)

Phase 2: Performance Enhancements

✅ Cache A×v in power iteration (spectral.rs:173,177)
✅ HashSet for can_attend (sparse_attention.rs:128)
✅ Lambda stability method (early_exit.rs:305,341)

Phase 3: Advanced Optimizations

✅ Lanczos algorithm (spectral.rs:176-184) - Requires more testing
✅ Sparse normalization (spectral.rs:122-128)
✅ Sorted query grouping (sparse_attention.rs:235-238)

Branch Prediction Analysis

Good Patterns (Minimal Mispredictions)

early_exit.rs:330-337 - Sequential threshold checks (likely same path)
mod_routing.rs:304-312 - Loop with consistent route type
sparse_attention.rs:243-244 - Early continue on empty (predictable)

Bad Patterns (High Misprediction Risk)

spectral.rs:85-87 - Random edge bounds check in tight loop
```
if u >= n || v >= n {  // ← Unpredictable based on data
    continue;
}
```
Fix: Pre-filter edges or use saturating operations.

sparse_attention.rs:415-419 - contains in nested loop

if !positions.contains(&pos) {  // ← Data-dependent branch
    positions.push(pos);
}

Fix: Already addressed by HashSet optimization.

Lookup Table Opportunities

MEDIUM: Softmax Exp Approximation

File: src/sparse_attention.rs:430-449

Current: Uses f32::exp() which is ~100 cycles.

Optimization: Lookup table with linear interpolation for exp(-x) in attention range.

const EXP_TABLE_SIZE: usize = 1024;
static EXP_TABLE: [f32; EXP_TABLE_SIZE] = /* precomputed exp values */;

#[inline]
fn fast_exp(x: f32) -> f32 {
    if x < -10.0 { return 0.0; }
    if x > 0.0 { return x.exp(); }  // Positive values rare in attention

    let idx = (-x * EXP_TABLE_SIZE as f32 / 10.0) as usize;
    if idx >= EXP_TABLE_SIZE - 1 {
        return 0.0;
    }

    // Linear interpolation
    let frac = (-x * EXP_TABLE_SIZE as f32 / 10.0) - idx as f32;
    EXP_TABLE[idx] * (1.0 - frac) + EXP_TABLE[idx + 1] * frac
}

Impact: 5-10x faster exp, <1% error for attention scores.

Mathematical Simplifications

spectral.rs: Symmetric Eigenvalue Property

The Laplacian is symmetric positive semi-definite, which enables:

Power iteration convergence: Guaranteed convergence to dominant eigenvector
Real eigenvalues: No complex arithmetic needed
Orthogonal eigenvectors: Can use Gram-Schmidt for orthogonalization

Current code correctly exploits (1) and (2), but could use (3) for better numerical stability in deflation.

Recommended Next Steps

Implement Phase 1 optimizations (sparse matrices, HashSet, heap-based top-k)
Benchmark on realistic workloads (n=512-2048 tokens, k=8-16 eigenvectors)
Profile with perf/flamegraph to validate bottlenecks
Consider SIMD for matrix operations (future work)
Add algorithmic complexity tests to prevent regressions

Analysis Completed: 11 optimization opportunities identified Estimated Overall Speedup: 10-50x for eigenvector computation, 5-10x for sparse attention Files Analyzed: 4 core algorithm files, 2,166 lines of code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithmic Optimization Analysis: Mincut-Gated Transformer

Executive Summary

1. src/spectral.rs - Eigenvector Computation

CRITICAL: Sparse Matrix Representation (O(n²) → O(E))

HIGH: Deflation Algorithm Inefficiency (O(k×n²) → O(k×n×iters))

MEDIUM: Redundant Matrix-Vector Product

LOW: Normalized Laplacian Computation

2. src/sparse_attention.rs - Sparse Attention Patterns

HIGH: O(n) Lookup in can_attend

CRITICAL: O(n²) Duplicate Detection in build_sparse_positions

MEDIUM: Inefficient Query Grouping

3. src/early_exit.rs - Early Exit Decision Logic

MEDIUM: Redundant Lambda Stability Calculation

HIGH: O(n log n) Top-K using Full Sort

4. src/mod_routing.rs - Mixture-of-Depths Routing

LOW: Mark Boundary Tokens - Minor Optimization

Summary of Optimizations

Implementation Priority

Phase 1: Critical Path (High Impact, Low Risk)

Phase 2: Performance Enhancements

Phase 3: Advanced Optimizations

Branch Prediction Analysis

Good Patterns (Minimal Mispredictions)

Bad Patterns (High Misprediction Risk)

Lookup Table Opportunities

MEDIUM: Softmax Exp Approximation

Mathematical Simplifications

spectral.rs: Symmetric Eigenvalue Property

Recommended Next Steps

FilesExpand file tree

algorithmic-optimization-analysis.md

Latest commit

History

algorithmic-optimization-analysis.md

File metadata and controls

Algorithmic Optimization Analysis: Mincut-Gated Transformer

Executive Summary

1. src/spectral.rs - Eigenvector Computation

CRITICAL: Sparse Matrix Representation (O(n²) → O(E))

HIGH: Deflation Algorithm Inefficiency (O(k×n²) → O(k×n×iters))

MEDIUM: Redundant Matrix-Vector Product

LOW: Normalized Laplacian Computation

2. src/sparse_attention.rs - Sparse Attention Patterns

HIGH: O(n) Lookup in can_attend

CRITICAL: O(n²) Duplicate Detection in build_sparse_positions

MEDIUM: Inefficient Query Grouping

3. src/early_exit.rs - Early Exit Decision Logic

MEDIUM: Redundant Lambda Stability Calculation

HIGH: O(n log n) Top-K using Full Sort

4. src/mod_routing.rs - Mixture-of-Depths Routing

LOW: Mark Boundary Tokens - Minor Optimization

Summary of Optimizations

Implementation Priority

Phase 1: Critical Path (High Impact, Low Risk)

Phase 2: Performance Enhancements

Phase 3: Advanced Optimizations

Branch Prediction Analysis

Good Patterns (Minimal Mispredictions)

Bad Patterns (High Misprediction Risk)

Lookup Table Opportunities

MEDIUM: Softmax Exp Approximation

Mathematical Simplifications

spectral.rs: Symmetric Eigenvalue Property

Recommended Next Steps