Skip to content

Commit 84defab

Browse files
committed
feat: introduce sentinel package
1 parent a401c0c commit 84defab

88 files changed

Lines changed: 32421 additions & 36 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Cargo.lock

Lines changed: 672 additions & 35 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[workspace]
2-
members = [".", "packages/render-text-as-image", "packages/mudlark"]
2+
members = [".", "packages/render-text-as-image", "packages/mudlark", "packages/sentinel"]
33

44
[package]
55
default-run = "torrust-index"

packages/sentinel/Cargo.toml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
[package]
2+
categories = ["algorithms", "network-programming"]
3+
description = "Hierarchical online subspace anomaly detection for positionally structured observation streams."
4+
keywords = ["anomaly-detection", "online-learning", "spectral", "streaming", "subspace"]
5+
name = "torrust-sentinel"
6+
readme = "README.md"
7+
version = "0.1.0"
8+
9+
authors.workspace = true
10+
documentation.workspace = true
11+
edition.workspace = true
12+
homepage.workspace = true
13+
license.workspace = true
14+
publish.workspace = true
15+
repository.workspace = true
16+
rust-version.workspace = true
17+
18+
[lints]
19+
workspace = true
20+
21+
[features]
22+
serde = ["dep:serde", "torrust-mudlark/serde"]
23+
24+
[dependencies]
25+
faer = "0"
26+
rand = "0.10"
27+
rand_distr = "0.6"
28+
serde = { version = "1", features = ["derive"], optional = true }
29+
torrust-mudlark = { path = "../mudlark", default-features = false, features = ["dynamic-contour-tracking"] }
30+
tracing = "0"
31+
32+
[dev-dependencies]
33+
criterion = { version = "0", features = ["html_reports"] }
34+
serde_json = "1"
35+
tracing-subscriber = { version = "0.3", features = ["registry", "env-filter"] }
36+
37+
[[bench]]
38+
harness = false
39+
name = "sentinel"

packages/sentinel/README.md

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
# Spectral Sentinel
2+
3+
Hierarchical online subspace anomaly spectrometer for positionally structured observation streams, generic over coordinate type, accumulator type, and domain bit-width.
4+
5+
The sentinel maintains low-rank linear subspace models of "normal" traffic per analysis cell and scores each incoming batch against those models using streaming thin SVD with exponential forgetting. It produces raw statistical measurements — never opinions, threat levels, or recommended actions.
6+
7+
**The sentinel measures; the host decides.**
8+
9+
## Use Case
10+
11+
The sentinel analyses **hierarchical positional structure** in coordinate values — leading bits define coarse groupings and successive bits refine them. The default instantiation (`Sentinel128`) targets 128-bit domains such as IPv6 addresses; `Sentinel64` covers 64-bit domains.
12+
13+
Pseudo-random values (cryptographic hashes, UUIDs, nonces) have no exploitable bit-positional structure and will not produce meaningful results.
14+
15+
## Quick Start
16+
17+
```rust
18+
use torrust_sentinel::Sentinel128;
19+
use torrust_sentinel::config::SentinelConfig;
20+
21+
// Configure — small analysis budget for a demo
22+
let cfg = SentinelConfig::<u64> {
23+
analysis_k: 4,
24+
..SentinelConfig::default()
25+
};
26+
27+
// Create — the root tracker is automatically warmed with synthetic noise
28+
let mut sentinel = Sentinel128::new(cfg).unwrap();
29+
30+
// Ingest observations
31+
let values: Vec<u128> = vec![
32+
0xF000_0000_0000_0000_0000_0000_0000_0001,
33+
0xF000_0000_0000_0000_0000_0000_0000_0002,
34+
0x1000_0000_0000_0000_0000_0000_0000_0003,
35+
];
36+
let report = sentinel.ingest(&values);
37+
38+
// Inspect results — root cell is always an ancestor
39+
for cell in report.cell_reports.iter().chain(report.ancestor_reports.iter()) {
40+
let s = &cell.scores;
41+
println!(
42+
"cell depth {} [{:#034x}, {:#034x}) — novelty z={:.2}, displacement z={:.2}",
43+
cell.depth, cell.start, cell.end,
44+
s.novelty.max_z_score, s.displacement.max_z_score,
45+
);
46+
}
47+
```
48+
49+
## Architecture
50+
51+
The sentinel implements a three-layer adaptive architecture backed by the G-V Graph spatial substrate (see [algorithm.md](docs/algorithm.md) for the full specification and [implementation.md](docs/implementation.md) for implementation status).
52+
53+
### Three-Layer Design (§ALGO S-1.1)
54+
55+
```
56+
Layer 1: G-V Graph ── pure volume tracking, Δ = 1 always
57+
Adaptive spatial partitioning of [0, 2^N)
58+
Competitive ranking by traffic volume
59+
60+
│ V-Tree depth ≤ cutoff → top-K selection
61+
62+
Layer 2: Analysis Selector ── picks competitive cells, closes under G-ancestry
63+
64+
│ suffix bit vectors at every ancestor depth
65+
66+
Layer 3: Analysis Engine ── SubspaceTrackers at competitive + ancestor cells
67+
│ Hierarchical coordination (G-tree bottom-up)
68+
69+
BatchReport<C> → host
70+
```
71+
72+
**G-V Graph spatial substrate.** The sentinel owns a `GvGraph<C, V, N>` that adaptively partitions the full `[0, 2^N)` domain. Each observation feeds the graph with Δ = 1 (pure volume counting — no feedback from anomaly scores). The graph splits, evicts, and rebalances cells autonomously. The default instantiation (`Sentinel128`) uses `GvGraph<u128, u64, 128>`; see [ADR-S-018](adr/018-generic-domain-parameters.md) for the generic parameter design.
73+
74+
**Analysis selector (Layer 2).** After each observation pass, the analysis set is recomputed: the top `analysis_k` V-Tree entries by importance (with V-depth ≤ `analysis_depth_cutoff`) are selected as competitive cells, then closed under G-tree ancestry. Each cell in the full analysis set gets a `SubspaceTracker` at suffix width `w = N − depth`.
75+
76+
**Coordination.** After per-cell scoring, hierarchical coordination contexts at G-tree internal nodes analyse cross-cell score patterns bottom-up to detect spatially coordinated anomalies that no single cell would flag. Coordination fires at a G-node when both its left and right subtrees contribute competitive cell scores.
77+
78+
### Core Loop (Per Tracker)
79+
80+
Each `SubspaceTracker` processes batches in five strictly ordered phases — **scoring precedes evolution** so the batch is always measured against the prior model:
81+
82+
1. **Score** — project onto learned subspace, compute four anomaly axes
83+
2. **Evolve Subspace** — streaming thin SVD with exponential forgetting (λ)
84+
3. **Evolve Latent Distribution** — EWMA mean, variance, second-moment matrix
85+
4. **Update Baselines & CUSUM** — fast/slow EWMA per axis, one-sided Page's test
86+
5. **Adapt Rank** — energy-threshold rank selection, ±1 step per evaluation
87+
88+
## The Four Scoring Axes
89+
90+
All axes satisfy a **polarity invariant**: higher values = more anomalous.
91+
92+
| Axis | Formula | Measures | Range |
93+
|------|---------|----------|-------|
94+
| **Novelty** | ‖residual‖² / (w − k) | Unexplained structure outside the learned subspace | [0, ∞) |
95+
| **Displacement** | ‖z‖² / (k + ‖z‖²) | Distance from the subspace centroid | [0, 1) |
96+
| **Surprise** | mean diagonal Mahalanobis | Per-dimension magnitude deviation | [0, ∞) |
97+
| **Coherence** | mean squared cross-product deviation | Unusual pairwise co-activation patterns | [0, ∞) |
98+
99+
Together they decompose the full covariance structure (total energy, diagonal, off-diagonal) without assembling or inverting a dense matrix.
100+
101+
## Baseline Tracking & Drift Detection
102+
103+
Each scoring axis maintains three components:
104+
105+
- **Fast EWMA** (decay λ) — running mean and variance for instantaneous z-scores, with upper-tail outlier filtering to resist baseline poisoning
106+
- **Slow EWMA** (decay λ_s > λ) — long-memory reference for CUSUM
107+
- **CUSUM accumulator** — one-sided Page's test detecting sustained upward drift of batch means from the slow baseline
108+
109+
The dual-EWMA design avoids frozen checkpoints that would require manual resets after legitimate regime changes. The slow baseline adapts automatically, just slowly enough to catch attacks before absorption.
110+
111+
## Configuration
112+
113+
```rust
114+
use torrust_sentinel::config::NoiseSchedule;
115+
116+
SentinelConfig {
117+
max_rank: 16, // rank ceiling per tracker
118+
forgetting_factor: 0.99, // EWMA λ — half-life ~69 batches
119+
rank_update_interval: 100,// batches between rank adaptation
120+
analysis_k: 1024, // max competitive analysis cells (§ALGO S-13.2)
121+
analysis_depth_cutoff: 6, // V-Tree depth cutoff for eligibility (§ALGO S-13.2)
122+
energy_threshold: 0.90, // cumulative variance target
123+
eps: 1e-6, // numerical stability
124+
cusum_slow_decay: 0.999, // slow EWMA λ_s — half-life ~693 batches
125+
cusum_coord_slow_decay: 0.999,
126+
cusum_allowance_sigmas: 0.5, // CUSUM noise tolerance (κ_σ)
127+
clip_sigmas: 3.0, // outlier clip width in σ units
128+
clip_pressure_decay: 0.95, // clip-pressure EWMA decay (λ_ρ, §ALGO S-6.4)
129+
per_sample_scores: false, // per-observation detail (expensive)
130+
split_threshold: 100, // G-V Graph: min observations before cell splits
131+
d_create: 3, // G-V Graph: max V-Tree depth for new splits
132+
d_evict: 6, // G-V Graph: min V-Tree depth for eviction
133+
budget: 100_000, // G-V Graph: hard ceiling on live G-node count
134+
noise_schedule: NoiseSchedule::default(), // depth-tiered noise rounds (ADR-S-015)
135+
noise_batch_size: 16, // samples per synthetic noise batch
136+
noise_seed: Some(42), // deterministic RNG seed (None = system entropy)
137+
background_warming: false, // warm cells on a background thread (ADR-S-017)
138+
svd_strategy: Default::default(), // Brand's incremental SVD (ADR-S-016)
139+
}
140+
```
141+
142+
Key tuning knobs:
143+
144+
| Parameter | Effect |
145+
|-----------|--------|
146+
| `analysis_k` | Resource ceiling for analysis tier. Total trackers bounded by `2 × analysis_k` (Steiner bound). |
147+
| `analysis_depth_cutoff` | Only V-entries that have risen above this depth are eligible. Prevents ephemeral cells from entering the analysis set. |
148+
| `forgetting_factor` | Memory length. Lower = faster adaptation, shorter memory. |
149+
| `max_rank` | Model expressiveness ceiling. Higher = richer model, more memory. |
150+
| `energy_threshold` | How much variance the rank must capture. Higher → rank grows. |
151+
| `split_threshold` | G-V Graph cell split sensitivity. Lower → finer spatial resolution faster. |
152+
| `d_create` / `d_evict` | G-V Graph depth gates. Control tree growth and eviction eligibility. |
153+
| `budget` | G-V Graph hard ceiling on live nodes. Prevents unbounded spatial growth. |
154+
| `noise_schedule` | Depth-tiered noise injection schedule. `Geometric { root, decay, min }` or `Explicit(vec![...])`. Replaces the former flat `noise_rounds` (ADR-S-015). |
155+
| `noise_batch_size` | Samples per synthetic noise batch. |
156+
| `noise_seed` | Deterministic RNG seed for reproducibility (`None` = system entropy). |
157+
| `clip_pressure_decay` | Clip-pressure EWMA decay factor (λ_ρ). Controls how quickly per-axis clip-pressure adapts. Default 0.95 (§ALGO S-6.4). |
158+
| `background_warming` | Warm new cells on a background thread (`true`) or synchronously during `ingest()` (`false`, default). Production deployments should enable. |
159+
| `svd_strategy` | `Brand` (default, ~2–3× faster) or `Naive` (dense thin SVD). In debug builds both run as an oracle test (ADR-S-016). |
160+
161+
## Automatic Noise Warm-Up
162+
163+
Every newly created tracker is **automatically warmed** with synthetic noise before receiving real observations (§ALGO S-11.2, [ADR-S-007](adr/007-automatic-noise-injection.md)). No manual injection API exists — the sentinel owns the injection lifecycle entirely.
164+
165+
- **Root tracker** — warmed at construction.
166+
- **New analysis cells** — enqueued into a staging area and warmed via a deferred pipeline ([ADR-S-017](adr/017-deferred-cell-warm-up.md)). Warm-up runs synchronously by default or on a background thread when `background_warming` is enabled.
167+
- **Coordination contexts** — warmed with Gamma-sampled synthetic score vectors (§ALGO S-9.8) when first activated.
168+
169+
Noise parameters are configured via `SentinelConfig`:
170+
171+
```rust
172+
use torrust_sentinel::config::NoiseSchedule;
173+
174+
SentinelConfig {
175+
// Depth-tiered: root gets 450 rounds, deeper cells taper via 0.5× decay, min 50.
176+
noise_schedule: NoiseSchedule::default(),
177+
noise_batch_size: 16, // samples per synthetic batch (default)
178+
noise_seed: Some(42), // deterministic RNG seed (default)
179+
..
180+
}
181+
```
182+
183+
The `NoiseSchedule` enum supports two variants:
184+
- `Geometric { root, decay, min }``rounds(d) = max(min, root × decay^d)`. Default: `{ root: 450, decay: 0.5, min: 50 }`.
185+
- `Explicit(Vec<u32>)` — per-depth round counts; last entry repeats for deeper cells.
186+
187+
Each tracker reports a `noise_influence` (η) value that decays exponentially with real observations: η = λⁿ after n real batches. The host should treat scores from trackers with η > 0.5 as preliminary.
188+
189+
## Report Structure
190+
191+
`ingest()` returns a `BatchReport`:
192+
193+
```
194+
BatchReport
195+
├── cell_reports: [CellReport] // competitive cells only
196+
│ ├── gnode_id, start, end, depth, analysis_width
197+
│ ├── is_competitive (true), sample_count
198+
│ ├── rank, energy_ratio, top_singular_value
199+
│ ├── scores: AnomalyScores // four axes with z-scores, baselines, CUSUM
200+
│ ├── maturity: TrackerMaturity // η, observation counts
201+
│ ├── geometry: ScoringGeometry // which axes are structurally active
202+
│ └── per_sample: Option<[SampleScore]> // if per_sample_scores enabled
203+
204+
├── ancestor_reports: [CellReport] // ancestor-only cells (incl. root)
205+
206+
├── coordination_reports: [CoordinationReport] // hierarchical cross-cell analysis
207+
│ ├── gnode_id, start, end, depth, cells_reporting
208+
│ ├── rank, energy_ratio, top_singular_value
209+
│ ├── scores: AnomalyScores
210+
│ ├── maturity: TrackerMaturity
211+
│ ├── geometry: ScoringGeometry
212+
│ └── per_member: Option<[MemberScore]> // per-cell identity + scores (if per_sample_scores)
213+
214+
├── contour: ContourSnapshot // spatial structure summary
215+
│ ├── plateau_count, cell_count
216+
│ └── total_importance
217+
218+
├── health: HealthReport // inline health snapshot
219+
│ ├── total_g_nodes, semi_internal_count, active_trackers
220+
│ ├── active_competitive_trackers, active_ancestor_trackers
221+
│ ├── active_coordination_contexts
222+
│ ├── investment_set_size, warming_trackers, warming_competitive_targets
223+
│ ├── lifetime_observations, cells_tracked
224+
│ ├── rank_distribution, maturity_distribution
225+
│ ├── geometry_distribution, coordination_health
226+
│ └── clip_pressure_distribution
227+
228+
└── analysis_set_summary: AnalysisSetSummary // analysis set overview
229+
├── competitive_size, full_size, investment_set_size
230+
├── depth_range, importance_range, v_depth_range
231+
└── degenerate_cells_skipped
232+
```
233+
234+
Each `AnomalyScores` contains per-axis `ScoreDistribution` with min/max/mean raw scores, z-scores against the fast baseline, baseline snapshots, and CUSUM accumulator state.
235+
236+
## Public API
237+
238+
| Method | Description |
239+
|--------|-------------|
240+
| `SpectralSentinel::new(config)` | Create and validate a new sentinel |
241+
| `.ingest(&[u128])` | Process a batch, return `BatchReport` |
242+
| `.health()` | Snapshot of tracker counts, rank distributions, maturity |
243+
| `.inspect_cell(gnode)` | Deep inspection of a single cell's tracker |
244+
| `.cell_gnodes()` | List all cell `GNodeId`s in the full analysis set |
245+
| `.cells_tracked()` | Number of cells in the full analysis set |
246+
| `.lifetime_observations()` | Total real observations processed across the sentinel's lifetime |
247+
| `.degenerate_cells_skipped()` | Number of cells excluded for suffix width below `MIN_TRACKER_DIM` (ADR-S-011) |
248+
| `.config()` | Read-only access to the configuration |
249+
| `.analysis_set()` | Read-only access to the current analysis set |
250+
| `.graph()` | Read-only access to the G-V Graph spatial substrate |
251+
| `.decay(attenuation, q)` | Apply temporal decay to the entire G-V Graph |
252+
| `.decay_subtree(gnode, att, q)` | Apply temporal decay to a subtree |
253+
| `.reset()` | Destroy all state, return to fresh |
254+
255+
## Features
256+
257+
| Feature | Effect |
258+
|---------|--------|
259+
| `serde` | Enables `Serialize`/`Deserialize` on all config and report types |
260+
261+
## Resource Considerations
262+
263+
The total tracker count is bounded by `2 × analysis_k` (competitive + ancestor cells). At the default `analysis_k = 1024`, this is at most 2,048 trackers. Each tracker uses ~14–24 KB depending on suffix width.
264+
265+
The G-V Graph's `budget` parameter caps the total number of live G-nodes. The `analysis_k` and `analysis_depth_cutoff` parameters control how many of those nodes get analysis trackers.
266+
267+
## Documentation
268+
269+
- [docs/algorithm.md](docs/algorithm.md) — full algorithm specification
270+
- [docs/implementation.md](docs/implementation.md) — implementation guide (source layout, architecture, design decisions)
271+
272+
## Known Divergences from Spec
273+
274+
The implementation conforms to the full specification. All architectural
275+
phases and the Chapter 14 report structure are complete.
276+
277+
The deferred cell warm-up (ADR-S-017) is implemented: the staging area
278+
infrastructure and background warming thread are complete, and
279+
`NoiseSchedule` depth-tiered schedule (ADR-S-015) is in use. Timing
280+
protection (S2 adaptive pad, S3 equalization) is out of scope
281+
(§ALGO S-18.5).
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# ADR-S-001: Measures Not Opinions
2+
3+
**Status:** Implemented
4+
**Date:** 2026-03-08
5+
**Spec:** §ALGO S-1.2 (layer responsibilities — "sentinel measures; host decides")
6+
7+
## Context
8+
9+
An anomaly detector can either:
10+
11+
- **A)** Output raw statistical measurements and let the consumer
12+
decide what they mean (library approach).
13+
- **B)** Output verdicts — threat levels, recommended actions,
14+
block/allow decisions (appliance approach).
15+
16+
The sentinel is a library embedded inside the Torrust Index. Different
17+
hosts have different risk tolerances, different action vocabularies
18+
(ban, throttle, flag, ignore), and different false-positive
19+
consequences. Baking policy into the sentinel would force every host
20+
into one policy model.
21+
22+
## Decision
23+
24+
**The sentinel outputs only raw statistical measurements. It never
25+
outputs opinions, threat levels, or recommended actions.**
26+
27+
Concretely:
28+
29+
- `BatchReport` contains `AnomalyScores` (per-axis mean, max, z-score,
30+
CUSUM), `ScoringGeometry` (ADR-S-008), `TrackerMaturity`, rank,
31+
energy ratios.
32+
- No field is named "threat", "risk", "anomaly_level", or "action".
33+
- No method returns a boolean "is anomalous" verdict.
34+
- No internal threshold triggers automatic remediation.
35+
36+
The host reads the report and applies its own policy:
37+
38+
```rust
39+
// Host policy — not sentinel code:
40+
if report.scores.novelty.z_score > 4.0 && report.maturity.noise_influence < 0.1 {
41+
throttle(cell_id);
42+
}
43+
```
44+
45+
## Consequences
46+
47+
- The sentinel has no policy parameters (no "alert threshold", no
48+
"sensitivity level").
49+
- Report types carry more fields than an appliance would expose, but
50+
each field has a precise statistical definition.
51+
- Integration tests assert statistical properties, not verdicts.
52+
- Higher polarity = more anomalous is a uniform convention across
53+
all four scoring axes (§ALGO S-6), ensuring the host can apply a
54+
single threshold logic to any axis.

0 commit comments

Comments
 (0)