|
| 1 | +# Spectral Sentinel |
| 2 | + |
| 3 | +Hierarchical online subspace anomaly spectrometer for positionally structured observation streams, generic over coordinate type, accumulator type, and domain bit-width. |
| 4 | + |
| 5 | +The sentinel maintains low-rank linear subspace models of "normal" traffic per analysis cell and scores each incoming batch against those models using streaming thin SVD with exponential forgetting. It produces raw statistical measurements — never opinions, threat levels, or recommended actions. |
| 6 | + |
| 7 | +**The sentinel measures; the host decides.** |
| 8 | + |
| 9 | +## Use Case |
| 10 | + |
| 11 | +The sentinel analyses **hierarchical positional structure** in coordinate values — leading bits define coarse groupings and successive bits refine them. The default instantiation (`Sentinel128`) targets 128-bit domains such as IPv6 addresses; `Sentinel64` covers 64-bit domains. |
| 12 | + |
| 13 | +Pseudo-random values (cryptographic hashes, UUIDs, nonces) have no exploitable bit-positional structure and will not produce meaningful results. |
| 14 | + |
| 15 | +## Quick Start |
| 16 | + |
| 17 | +```rust |
| 18 | +use torrust_sentinel::Sentinel128; |
| 19 | +use torrust_sentinel::config::SentinelConfig; |
| 20 | + |
| 21 | +// Configure — small analysis budget for a demo |
| 22 | +let cfg = SentinelConfig::<u64> { |
| 23 | + analysis_k: 4, |
| 24 | + ..SentinelConfig::default() |
| 25 | +}; |
| 26 | + |
| 27 | +// Create — the root tracker is automatically warmed with synthetic noise |
| 28 | +let mut sentinel = Sentinel128::new(cfg).unwrap(); |
| 29 | + |
| 30 | +// Ingest observations |
| 31 | +let values: Vec<u128> = vec![ |
| 32 | + 0xF000_0000_0000_0000_0000_0000_0000_0001, |
| 33 | + 0xF000_0000_0000_0000_0000_0000_0000_0002, |
| 34 | + 0x1000_0000_0000_0000_0000_0000_0000_0003, |
| 35 | +]; |
| 36 | +let report = sentinel.ingest(&values); |
| 37 | + |
| 38 | +// Inspect results — root cell is always an ancestor |
| 39 | +for cell in report.cell_reports.iter().chain(report.ancestor_reports.iter()) { |
| 40 | + let s = &cell.scores; |
| 41 | + println!( |
| 42 | + "cell depth {} [{:#034x}, {:#034x}) — novelty z={:.2}, displacement z={:.2}", |
| 43 | + cell.depth, cell.start, cell.end, |
| 44 | + s.novelty.max_z_score, s.displacement.max_z_score, |
| 45 | + ); |
| 46 | +} |
| 47 | +``` |
| 48 | + |
| 49 | +## Architecture |
| 50 | + |
| 51 | +The sentinel implements a three-layer adaptive architecture backed by the G-V Graph spatial substrate (see [algorithm.md](docs/algorithm.md) for the full specification and [implementation.md](docs/implementation.md) for implementation status). |
| 52 | + |
| 53 | +### Three-Layer Design (§ALGO S-1.1) |
| 54 | + |
| 55 | +``` |
| 56 | +Layer 1: G-V Graph ── pure volume tracking, Δ = 1 always |
| 57 | + Adaptive spatial partitioning of [0, 2^N) |
| 58 | + Competitive ranking by traffic volume |
| 59 | + │ |
| 60 | + │ V-Tree depth ≤ cutoff → top-K selection |
| 61 | + ▼ |
| 62 | +Layer 2: Analysis Selector ── picks competitive cells, closes under G-ancestry |
| 63 | + │ |
| 64 | + │ suffix bit vectors at every ancestor depth |
| 65 | + ▼ |
| 66 | +Layer 3: Analysis Engine ── SubspaceTrackers at competitive + ancestor cells |
| 67 | + │ Hierarchical coordination (G-tree bottom-up) |
| 68 | + ▼ |
| 69 | + BatchReport<C> → host |
| 70 | +``` |
| 71 | + |
| 72 | +**G-V Graph spatial substrate.** The sentinel owns a `GvGraph<C, V, N>` that adaptively partitions the full `[0, 2^N)` domain. Each observation feeds the graph with Δ = 1 (pure volume counting — no feedback from anomaly scores). The graph splits, evicts, and rebalances cells autonomously. The default instantiation (`Sentinel128`) uses `GvGraph<u128, u64, 128>`; see [ADR-S-018](adr/018-generic-domain-parameters.md) for the generic parameter design. |
| 73 | + |
| 74 | +**Analysis selector (Layer 2).** After each observation pass, the analysis set is recomputed: the top `analysis_k` V-Tree entries by importance (with V-depth ≤ `analysis_depth_cutoff`) are selected as competitive cells, then closed under G-tree ancestry. Each cell in the full analysis set gets a `SubspaceTracker` at suffix width `w = N − depth`. |
| 75 | + |
| 76 | +**Coordination.** After per-cell scoring, hierarchical coordination contexts at G-tree internal nodes analyse cross-cell score patterns bottom-up to detect spatially coordinated anomalies that no single cell would flag. Coordination fires at a G-node when both its left and right subtrees contribute competitive cell scores. |
| 77 | + |
| 78 | +### Core Loop (Per Tracker) |
| 79 | + |
| 80 | +Each `SubspaceTracker` processes batches in five strictly ordered phases — **scoring precedes evolution** so the batch is always measured against the prior model: |
| 81 | + |
| 82 | +1. **Score** — project onto learned subspace, compute four anomaly axes |
| 83 | +2. **Evolve Subspace** — streaming thin SVD with exponential forgetting (λ) |
| 84 | +3. **Evolve Latent Distribution** — EWMA mean, variance, second-moment matrix |
| 85 | +4. **Update Baselines & CUSUM** — fast/slow EWMA per axis, one-sided Page's test |
| 86 | +5. **Adapt Rank** — energy-threshold rank selection, ±1 step per evaluation |
| 87 | + |
| 88 | +## The Four Scoring Axes |
| 89 | + |
| 90 | +All axes satisfy a **polarity invariant**: higher values = more anomalous. |
| 91 | + |
| 92 | +| Axis | Formula | Measures | Range | |
| 93 | +|------|---------|----------|-------| |
| 94 | +| **Novelty** | ‖residual‖² / (w − k) | Unexplained structure outside the learned subspace | [0, ∞) | |
| 95 | +| **Displacement** | ‖z‖² / (k + ‖z‖²) | Distance from the subspace centroid | [0, 1) | |
| 96 | +| **Surprise** | mean diagonal Mahalanobis | Per-dimension magnitude deviation | [0, ∞) | |
| 97 | +| **Coherence** | mean squared cross-product deviation | Unusual pairwise co-activation patterns | [0, ∞) | |
| 98 | + |
| 99 | +Together they decompose the full covariance structure (total energy, diagonal, off-diagonal) without assembling or inverting a dense matrix. |
| 100 | + |
| 101 | +## Baseline Tracking & Drift Detection |
| 102 | + |
| 103 | +Each scoring axis maintains three components: |
| 104 | + |
| 105 | +- **Fast EWMA** (decay λ) — running mean and variance for instantaneous z-scores, with upper-tail outlier filtering to resist baseline poisoning |
| 106 | +- **Slow EWMA** (decay λ_s > λ) — long-memory reference for CUSUM |
| 107 | +- **CUSUM accumulator** — one-sided Page's test detecting sustained upward drift of batch means from the slow baseline |
| 108 | + |
| 109 | +The dual-EWMA design avoids frozen checkpoints that would require manual resets after legitimate regime changes. The slow baseline adapts automatically, just slowly enough to catch attacks before absorption. |
| 110 | + |
| 111 | +## Configuration |
| 112 | + |
| 113 | +```rust |
| 114 | +use torrust_sentinel::config::NoiseSchedule; |
| 115 | + |
| 116 | +SentinelConfig { |
| 117 | + max_rank: 16, // rank ceiling per tracker |
| 118 | + forgetting_factor: 0.99, // EWMA λ — half-life ~69 batches |
| 119 | + rank_update_interval: 100,// batches between rank adaptation |
| 120 | + analysis_k: 1024, // max competitive analysis cells (§ALGO S-13.2) |
| 121 | + analysis_depth_cutoff: 6, // V-Tree depth cutoff for eligibility (§ALGO S-13.2) |
| 122 | + energy_threshold: 0.90, // cumulative variance target |
| 123 | + eps: 1e-6, // numerical stability |
| 124 | + cusum_slow_decay: 0.999, // slow EWMA λ_s — half-life ~693 batches |
| 125 | + cusum_coord_slow_decay: 0.999, |
| 126 | + cusum_allowance_sigmas: 0.5, // CUSUM noise tolerance (κ_σ) |
| 127 | + clip_sigmas: 3.0, // outlier clip width in σ units |
| 128 | + clip_pressure_decay: 0.95, // clip-pressure EWMA decay (λ_ρ, §ALGO S-6.4) |
| 129 | + per_sample_scores: false, // per-observation detail (expensive) |
| 130 | + split_threshold: 100, // G-V Graph: min observations before cell splits |
| 131 | + d_create: 3, // G-V Graph: max V-Tree depth for new splits |
| 132 | + d_evict: 6, // G-V Graph: min V-Tree depth for eviction |
| 133 | + budget: 100_000, // G-V Graph: hard ceiling on live G-node count |
| 134 | + noise_schedule: NoiseSchedule::default(), // depth-tiered noise rounds (ADR-S-015) |
| 135 | + noise_batch_size: 16, // samples per synthetic noise batch |
| 136 | + noise_seed: Some(42), // deterministic RNG seed (None = system entropy) |
| 137 | + background_warming: false, // warm cells on a background thread (ADR-S-017) |
| 138 | + svd_strategy: Default::default(), // Brand's incremental SVD (ADR-S-016) |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +Key tuning knobs: |
| 143 | + |
| 144 | +| Parameter | Effect | |
| 145 | +|-----------|--------| |
| 146 | +| `analysis_k` | Resource ceiling for analysis tier. Total trackers bounded by `2 × analysis_k` (Steiner bound). | |
| 147 | +| `analysis_depth_cutoff` | Only V-entries that have risen above this depth are eligible. Prevents ephemeral cells from entering the analysis set. | |
| 148 | +| `forgetting_factor` | Memory length. Lower = faster adaptation, shorter memory. | |
| 149 | +| `max_rank` | Model expressiveness ceiling. Higher = richer model, more memory. | |
| 150 | +| `energy_threshold` | How much variance the rank must capture. Higher → rank grows. | |
| 151 | +| `split_threshold` | G-V Graph cell split sensitivity. Lower → finer spatial resolution faster. | |
| 152 | +| `d_create` / `d_evict` | G-V Graph depth gates. Control tree growth and eviction eligibility. | |
| 153 | +| `budget` | G-V Graph hard ceiling on live nodes. Prevents unbounded spatial growth. | |
| 154 | +| `noise_schedule` | Depth-tiered noise injection schedule. `Geometric { root, decay, min }` or `Explicit(vec![...])`. Replaces the former flat `noise_rounds` (ADR-S-015). | |
| 155 | +| `noise_batch_size` | Samples per synthetic noise batch. | |
| 156 | +| `noise_seed` | Deterministic RNG seed for reproducibility (`None` = system entropy). | |
| 157 | +| `clip_pressure_decay` | Clip-pressure EWMA decay factor (λ_ρ). Controls how quickly per-axis clip-pressure adapts. Default 0.95 (§ALGO S-6.4). | |
| 158 | +| `background_warming` | Warm new cells on a background thread (`true`) or synchronously during `ingest()` (`false`, default). Production deployments should enable. | |
| 159 | +| `svd_strategy` | `Brand` (default, ~2–3× faster) or `Naive` (dense thin SVD). In debug builds both run as an oracle test (ADR-S-016). | |
| 160 | + |
| 161 | +## Automatic Noise Warm-Up |
| 162 | + |
| 163 | +Every newly created tracker is **automatically warmed** with synthetic noise before receiving real observations (§ALGO S-11.2, [ADR-S-007](adr/007-automatic-noise-injection.md)). No manual injection API exists — the sentinel owns the injection lifecycle entirely. |
| 164 | + |
| 165 | +- **Root tracker** — warmed at construction. |
| 166 | +- **New analysis cells** — enqueued into a staging area and warmed via a deferred pipeline ([ADR-S-017](adr/017-deferred-cell-warm-up.md)). Warm-up runs synchronously by default or on a background thread when `background_warming` is enabled. |
| 167 | +- **Coordination contexts** — warmed with Gamma-sampled synthetic score vectors (§ALGO S-9.8) when first activated. |
| 168 | + |
| 169 | +Noise parameters are configured via `SentinelConfig`: |
| 170 | + |
| 171 | +```rust |
| 172 | +use torrust_sentinel::config::NoiseSchedule; |
| 173 | + |
| 174 | +SentinelConfig { |
| 175 | + // Depth-tiered: root gets 450 rounds, deeper cells taper via 0.5× decay, min 50. |
| 176 | + noise_schedule: NoiseSchedule::default(), |
| 177 | + noise_batch_size: 16, // samples per synthetic batch (default) |
| 178 | + noise_seed: Some(42), // deterministic RNG seed (default) |
| 179 | + .. |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +The `NoiseSchedule` enum supports two variants: |
| 184 | +- `Geometric { root, decay, min }` — `rounds(d) = max(min, root × decay^d)`. Default: `{ root: 450, decay: 0.5, min: 50 }`. |
| 185 | +- `Explicit(Vec<u32>)` — per-depth round counts; last entry repeats for deeper cells. |
| 186 | + |
| 187 | +Each tracker reports a `noise_influence` (η) value that decays exponentially with real observations: η = λⁿ after n real batches. The host should treat scores from trackers with η > 0.5 as preliminary. |
| 188 | + |
| 189 | +## Report Structure |
| 190 | + |
| 191 | +`ingest()` returns a `BatchReport`: |
| 192 | + |
| 193 | +``` |
| 194 | +BatchReport |
| 195 | +├── cell_reports: [CellReport] // competitive cells only |
| 196 | +│ ├── gnode_id, start, end, depth, analysis_width |
| 197 | +│ ├── is_competitive (true), sample_count |
| 198 | +│ ├── rank, energy_ratio, top_singular_value |
| 199 | +│ ├── scores: AnomalyScores // four axes with z-scores, baselines, CUSUM |
| 200 | +│ ├── maturity: TrackerMaturity // η, observation counts |
| 201 | +│ ├── geometry: ScoringGeometry // which axes are structurally active |
| 202 | +│ └── per_sample: Option<[SampleScore]> // if per_sample_scores enabled |
| 203 | +│ |
| 204 | +├── ancestor_reports: [CellReport] // ancestor-only cells (incl. root) |
| 205 | +│ |
| 206 | +├── coordination_reports: [CoordinationReport] // hierarchical cross-cell analysis |
| 207 | +│ ├── gnode_id, start, end, depth, cells_reporting |
| 208 | +│ ├── rank, energy_ratio, top_singular_value |
| 209 | +│ ├── scores: AnomalyScores |
| 210 | +│ ├── maturity: TrackerMaturity |
| 211 | +│ ├── geometry: ScoringGeometry |
| 212 | +│ └── per_member: Option<[MemberScore]> // per-cell identity + scores (if per_sample_scores) |
| 213 | +│ |
| 214 | +├── contour: ContourSnapshot // spatial structure summary |
| 215 | +│ ├── plateau_count, cell_count |
| 216 | +│ └── total_importance |
| 217 | +│ |
| 218 | +├── health: HealthReport // inline health snapshot |
| 219 | +│ ├── total_g_nodes, semi_internal_count, active_trackers |
| 220 | +│ ├── active_competitive_trackers, active_ancestor_trackers |
| 221 | +│ ├── active_coordination_contexts |
| 222 | +│ ├── investment_set_size, warming_trackers, warming_competitive_targets |
| 223 | +│ ├── lifetime_observations, cells_tracked |
| 224 | +│ ├── rank_distribution, maturity_distribution |
| 225 | +│ ├── geometry_distribution, coordination_health |
| 226 | +│ └── clip_pressure_distribution |
| 227 | +│ |
| 228 | +└── analysis_set_summary: AnalysisSetSummary // analysis set overview |
| 229 | + ├── competitive_size, full_size, investment_set_size |
| 230 | + ├── depth_range, importance_range, v_depth_range |
| 231 | + └── degenerate_cells_skipped |
| 232 | +``` |
| 233 | + |
| 234 | +Each `AnomalyScores` contains per-axis `ScoreDistribution` with min/max/mean raw scores, z-scores against the fast baseline, baseline snapshots, and CUSUM accumulator state. |
| 235 | + |
| 236 | +## Public API |
| 237 | + |
| 238 | +| Method | Description | |
| 239 | +|--------|-------------| |
| 240 | +| `SpectralSentinel::new(config)` | Create and validate a new sentinel | |
| 241 | +| `.ingest(&[u128])` | Process a batch, return `BatchReport` | |
| 242 | +| `.health()` | Snapshot of tracker counts, rank distributions, maturity | |
| 243 | +| `.inspect_cell(gnode)` | Deep inspection of a single cell's tracker | |
| 244 | +| `.cell_gnodes()` | List all cell `GNodeId`s in the full analysis set | |
| 245 | +| `.cells_tracked()` | Number of cells in the full analysis set | |
| 246 | +| `.lifetime_observations()` | Total real observations processed across the sentinel's lifetime | |
| 247 | +| `.degenerate_cells_skipped()` | Number of cells excluded for suffix width below `MIN_TRACKER_DIM` (ADR-S-011) | |
| 248 | +| `.config()` | Read-only access to the configuration | |
| 249 | +| `.analysis_set()` | Read-only access to the current analysis set | |
| 250 | +| `.graph()` | Read-only access to the G-V Graph spatial substrate | |
| 251 | +| `.decay(attenuation, q)` | Apply temporal decay to the entire G-V Graph | |
| 252 | +| `.decay_subtree(gnode, att, q)` | Apply temporal decay to a subtree | |
| 253 | +| `.reset()` | Destroy all state, return to fresh | |
| 254 | + |
| 255 | +## Features |
| 256 | + |
| 257 | +| Feature | Effect | |
| 258 | +|---------|--------| |
| 259 | +| `serde` | Enables `Serialize`/`Deserialize` on all config and report types | |
| 260 | + |
| 261 | +## Resource Considerations |
| 262 | + |
| 263 | +The total tracker count is bounded by `2 × analysis_k` (competitive + ancestor cells). At the default `analysis_k = 1024`, this is at most 2,048 trackers. Each tracker uses ~14–24 KB depending on suffix width. |
| 264 | + |
| 265 | +The G-V Graph's `budget` parameter caps the total number of live G-nodes. The `analysis_k` and `analysis_depth_cutoff` parameters control how many of those nodes get analysis trackers. |
| 266 | + |
| 267 | +## Documentation |
| 268 | + |
| 269 | +- [docs/algorithm.md](docs/algorithm.md) — full algorithm specification |
| 270 | +- [docs/implementation.md](docs/implementation.md) — implementation guide (source layout, architecture, design decisions) |
| 271 | + |
| 272 | +## Known Divergences from Spec |
| 273 | + |
| 274 | +The implementation conforms to the full specification. All architectural |
| 275 | +phases and the Chapter 14 report structure are complete. |
| 276 | + |
| 277 | +The deferred cell warm-up (ADR-S-017) is implemented: the staging area |
| 278 | +infrastructure and background warming thread are complete, and |
| 279 | +`NoiseSchedule` depth-tiered schedule (ADR-S-015) is in use. Timing |
| 280 | +protection (S2 adaptive pad, S3 equalization) is out of scope |
| 281 | +(§ALGO S-18.5). |
0 commit comments