Status: Proposed
Date: 2026-01-22
Authors: ruv.io, RuVector Team
Deciders: Architecture Review Board
Target Crate: ruvector-attention
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-01-22 | ruv.io | Initial proposal for coherence-gated attention |
Standard transformers have fundamental efficiency issues:
- Quadratic attention: O(N²) for sequence length N
- Fixed computation: Every token gets same compute regardless of difficulty
- Dense by default: All attention weights computed even when most are near-zero
- Confidence-based exits: Early exit uses unreliable confidence scores
| Approach | Method | Limitation |
|---|---|---|
| Flash Attention | Memory-efficient matmul | Still O(N²) compute |
| Sparse Attention | Fixed patterns (local, strided) | Patterns don't adapt to content |
| Linear Attention | Kernel approximation | Quality degradation |
| Early Exit | Confidence threshold | Confidence ≠ correctness |
| MoE | Expert routing | Routing is learned, not principled |
Prime-Radiant's coherence engine provides a mathematically grounded measure of consistency. This can be applied to attention:
Core idea: Tokens that are already coherent with context don't need expensive attention. Route computation based on coherence energy, not learned confidence.
A novel attention mechanism that uses sheaf coherence to:
- Route tokens to different compute depths
- Sparsify attention based on residual energy
- Exit early when energy converges
- Replace QKV projections with restriction maps
┌─────────────────────────────────────────────────────────────────────────────┐
│ COHERENCE-GATED TRANSFORMER (CGT) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ INPUT PROCESSING ││
│ │ Tokens ──► Embedding ──► Initial Coherence Graph ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ COHERENCE ROUTER ││
│ │ ││
│ │ For each token t: ││
│ │ E(t) = Σ w_e ||ρ_t(x_t) - ρ_ctx(x_ctx)||² ││
│ │ ││
│ │ Route based on energy: ││
│ │ ┌──────────────┬──────────────┬──────────────┐ ││
│ │ │ E < θ_reflex │ E < θ_std │ E ≥ θ_std │ ││
│ │ │ │ │ │ │ │ │ ││
│ │ │ ▼ │ ▼ │ ▼ │ ││
│ │ │ LANE 0 │ LANE 1 │ LANE 2 │ ││
│ │ │ Reflex │ Standard │ Deep │ ││
│ │ └──────────────┴──────────────┴──────────────┘ ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ LANE 0 │ │ LANE 1 │ │ LANE 2 │ │
│ │ REFLEX │ │ STANDARD │ │ DEEP │ │
│ │ │ │ │ │ │ │
│ │ • 1-2 layers │ • 6 layers│ │ • 12+ layers │
│ │ • Local attention │ • Sparse │ │ • Full + MoE │
│ │ (window=64) │ sheaf │ │ • All experts │
│ │ • No FFN │ attn │ │ • Spectral │
│ │ • <0.1ms │ • ~1ms │ │ analysis │
│ │ │ │ │ • ~5ms │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ COHERENCE VERIFICATION ││
│ │ ││
│ │ E_final = compute_energy(output_graph) ││
│ │ ││
│ │ if E_final > θ_max: ││
│ │ → Escalate to Lane 2 OR refuse generation ││
│ │ else: ││
│ │ → Output with witness ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ Output + Witness │
└─────────────────────────────────────────────────────────────────────────────┘
Replace standard scaled dot-product attention with coherence-based attention:
Standard Attention:
Attention(Q, K, V) = softmax(QK^T / √d) V
Sheaf Attention:
R_ij = ||ρ_i(x_i) - ρ_j(x_j)||² # Residual energy
A_ij = exp(-β × R_ij) / Σ_k exp(-β × R_ik) # Coherence-based weight
Output = A × V
Key difference: Attention weight is inversely proportional to residual energy.
- High residual (incoherent) → Low attention (don't propagate inconsistency)
- Low residual (coherent) → High attention (reinforce consistency)
Replace learned W_q, W_k, W_v with restriction maps:
Standard:
Q = W_q × x (learned projection)
K = W_k × x
V = W_v × x
Sheaf:
Q = ρ_q(x) (restriction map to query manifold)
K = ρ_k(x) (restriction map to key manifold)
V = ρ_v(x) (restriction map to value manifold)
Benefits:
- Restriction maps have geometric meaning (project to shared space)
- Can be initialized from domain knowledge
- Residuals are interpretable
def route_token(token_embedding, context_graph):
# Compute coherence energy with context
energy = compute_token_energy(token_embedding, context_graph)
if energy < THETA_REFLEX:
return Lane.REFLEX # Minimal compute
elif energy < THETA_STANDARD:
return Lane.STANDARD # Normal compute
else:
return Lane.DEEP # Maximum computeRouting thresholds (tunable via SONA):
| Threshold | Default | Meaning |
|---|---|---|
| θ_reflex | 0.01 | Token is highly coherent with context |
| θ_standard | 0.1 | Token has minor inconsistencies |
| θ_deep | 1.0 | Token has major inconsistencies |
Only compute attention for token pairs with high residual:
def sparse_sheaf_attention(X, threshold):
N = len(X)
attention_mask = zeros(N, N)
for i in range(N):
for j in range(N):
residual = compute_residual(X[i], X[j])
if residual > threshold:
# These tokens are incoherent - need attention
attention_mask[i, j] = 1
# else: skip attention (already coherent)
# Compute attention only for non-zero mask entries
return masked_attention(X, attention_mask)Sparsity pattern: Adapts to content, not fixed like local/strided attention.
def forward_with_early_exit(x, layers, epsilon=0.001):
prev_energy = float('inf')
for layer in layers:
x = layer(x)
curr_energy = compute_energy(x)
delta = abs(curr_energy - prev_energy)
if delta < epsilon:
# Energy converged - no need for more layers
return x
prev_energy = curr_energy
return xExit criterion: Energy convergence, not confidence threshold.
Layers: 1-2
Attention: Local only (window=64)
FFN: Skip or minimal
Use case: Common tokens, clear context
Example: "the", "is", "and" in well-formed sentences
Layers: 6
Attention: Sparse sheaf (residual > 0.05)
FFN: Standard
Use case: Normal tokens requiring context integration
Example: Most content words
Layers: 12+
Attention: Full sheaf + MoE routing
FFN: Expert mixture
Spectral: Eigenvalue analysis for structural issues
Use case: Ambiguous, contradictory, or complex tokens
Example: "bank" (river or financial?), negations, rare words
Action: Return uncertainty, request clarification
Use case: Irreconcilable incoherence
Example: "The cat is not a cat" - logical contradiction
Given tokens X = {x_1, ..., x_N} and restriction maps ρ_i, ρ_j:
Residual:
r_ij = ρ_i(x_i) - ρ_j(x_j)
Edge energy:
E_ij = w_ij × ||r_ij||²
Token energy:
E_i = Σ_j E_ij (sum over edges incident to i)
Attention weight (coherence-based):
A_ij = exp(-β × E_ij) / Σ_k exp(-β × E_ik)
Output:
y_i = Σ_j A_ij × V_j
| Operation | Standard | Sheaf (Dense) | Sheaf (Sparse, s% non-zero) |
|---|---|---|---|
| Attention | O(N²d) | O(N²d) | O(s×N²d) |
| Routing | - | O(Nd) | O(Nd) |
| Early exit | - | O(Ld) per check | O(Ld) per check |
| Total | O(N²Ld) | O(N²Ld) | O(s×N²Ld + routing) |
With typical s=10-20% sparsity and 50% early exit: 5-10x speedup.
ruvector-attention/
├── src/
│ ├── sheaf/ # NEW: Sheaf attention
│ │ ├── mod.rs
│ │ ├── attention.rs # SheafAttention layer
│ │ ├── restriction.rs # Restriction map projections
│ │ ├── router.rs # Token-level routing
│ │ ├── sparse.rs # Residual-sparse attention
│ │ └── early_exit.rs # Energy-based early exit
│ │
│ ├── coherence_gated/ # NEW: Full CGT implementation
│ │ ├── mod.rs
│ │ ├── transformer.rs # CoherenceGatedTransformer
│ │ ├── lane.rs # ComputeLane enum + configs
│ │ ├── config.rs # CGTConfig
│ │ └── benchmark.rs # Latency/quality benchmarks
│ │
│ └── ... (existing modules)
/// Sheaf-based attention layer
pub struct SheafAttention {
/// Restriction map for queries
pub rho_query: RestrictionMap,
/// Restriction map for keys
pub rho_key: RestrictionMap,
/// Restriction map for values
pub rho_value: RestrictionMap,
/// Temperature for attention softmax
pub beta: f32,
/// Sparsity threshold
pub sparsity_threshold: f32,
}
/// Compute lane for token routing
#[derive(Debug, Clone, Copy)]
pub enum ComputeLane {
/// Minimal compute (<0.1ms)
Reflex,
/// Standard compute (~1ms)
Standard,
/// Deep compute (~5ms)
Deep,
/// Escalate to caller
Escalate,
}
/// Coherence-Gated Transformer configuration
pub struct CGTConfig {
/// Embedding dimension
pub d_model: usize,
/// Layers per lane
pub layers_per_lane: [usize; 3], // [reflex, standard, deep]
/// Routing thresholds
pub thresholds: CoherenceThresholds,
/// Sparsity settings
pub sparsity: SparsityConfig,
/// Early exit settings
pub early_exit: EarlyExitConfig,
}
/// Token routing decision
pub struct RoutingDecision {
pub token_id: usize,
pub energy: f32,
pub lane: ComputeLane,
pub attention_mask: Option<SparseMask>,
}[features]
# Sheaf attention (requires prime-radiant)
sheaf = ["dep:prime-radiant"]
# Full CGT implementation
coherence-gated = ["sheaf", "sparse", "moe"]
# Benchmarking utilities
cgt-bench = ["coherence-gated", "criterion"]| Metric | Standard Transformer | CGT Target | Improvement |
|---|---|---|---|
| Average latency (128 tokens) | 10ms | 1-2ms | 5-10x |
| P99 latency (128 tokens) | 15ms | 8ms | 2x |
| Memory (batch=32) | 2GB | 800MB | 2.5x |
| Quality (perplexity) | Baseline | <5% degradation | Acceptable |
Standard (10ms total):
Attention: 6ms (60%)
FFN: 3ms (30%)
Other: 1ms (10%)
CGT Target (2ms total):
Routing: 0.1ms (5%)
Attention (sparse): 1ms (50%)
FFN (conditional): 0.7ms (35%)
Other: 0.2ms (10%)
Every output is guaranteed to have coherence energy below threshold:
E(output) < θ_max OR escalate/refuse
This is stronger than confidence-based systems which can be confidently wrong.
Under compute pressure:
- Raise θ_reflex → more tokens to Lane 0
- Increase sparsity threshold → fewer attention computations
- Quality degrades predictably (energy increases)
For any output:
- Which tokens went to which lane?
- Which token pairs had high residuals?
- Where did the model "struggle"?
| Feature | Flash Attention | Sparse Transformers | MoE | CGT (Ours) |
|---|---|---|---|---|
| Adaptive compute | No | No | Yes | Yes |
| Content-based sparsity | No | No | Partial | Yes |
| Mathematical grounding | No | No | No | Yes (sheaf) |
| Quality guarantee | No | No | No | Yes (energy bound) |
| Interpretable routing | N/A | N/A | Partial | Yes |
| Early exit criterion | N/A | N/A | Confidence | Energy convergence |
-
Restriction map initialization: Random vs. pre-trained vs. analytical?
-
Threshold tuning: Can SONA auto-tune θ values during inference?
-
Multi-head sheaf attention: One graph per head, or shared graph?
-
Training objective: Standard cross-entropy + energy regularization?
-
Hardware optimization: Can residual computation be fused with attention kernels?
-
SheafAttentionlayer with restriction maps - Basic residual computation
- Unit tests for mathematical correctness
-
ComputeLaneenum and routing logic - Token-level energy computation
- Lane-specific layer configurations
- Residual-sparse attention mask generation
- Efficient sparse attention kernel
- Sparsity pattern analysis tools
-
CoherenceGatedTransformerfull implementation - Early exit with energy convergence
- Benchmarking suite
- SIMD optimization for residual computation
- Kernel fusion opportunities
- SONA integration for threshold tuning
prime-radiant(coherence computation)ruvector-core(vector operations)ndarray(matrix operations)
rayon(parallel routing)criterion(benchmarking)
-
Hansen, J., & Ghrist, R. (2019). "Toward a spectral theory of cellular sheaves."
-
Vaswani et al. (2017). "Attention Is All You Need."
-
Kitaev et al. (2020). "Reformer: The Efficient Transformer."
-
Fedus et al. (2022). "Switch Transformers: Scaling to Trillion Parameter Models."
-
ADR-014: Coherence Engine Architecture
- ADR-014: Coherence Engine Architecture (Prime-Radiant)
- ADR-003: SIMD Optimization Strategy
- ADR-006: Memory Management
| Name | Rationale |
|---|---|
| Coherence-Gated Transformer (CGT) | Descriptive, clear function |
| Sheaf Attention | Mathematical foundation |
| Residual-Routed Transformer | Emphasizes routing mechanism |
| Energy-Adaptive Transformer | Emphasizes efficiency |
| Prime Transformer | Connection to Prime-Radiant |
Recommended: "Coherence-Gated Transformer (CGT)" for the architecture, "Sheaf Attention" for the attention mechanism.