Project: Domain-Specific RAG Evaluation & MLOps Platform
Version: 1.1.0
Last updated: 2026-03-27
- Overview: Why Drift Detection for RAG?
- System Architecture — Four Layers
- Layer 1: DriftStore — KPI Persistence Reader
- Layer 2: DriftDetector — Welch Z-Test Algorithm
- Layer 3: Scheduler — Background APScheduler
- Layer 4: Notifier — Slack Webhook Integration
- Frontend: DriftMonitorBanner Component
- CI/CD: 733-Test Parallel Suite
- Configuring Drift Detection
- Interpreting Alerts
- Operational Runbook
A RAG system's answer quality depends entirely on the Knowledge Graph (KG) it retrieves from. When the KG degrades — through stale documents, incomplete ingestion, or domain shift — the retrieval topology changes before users notice hallucinated answers.
The GCR metric family (
Key properties of the detection system:
- No LLM calls — purely numeric/statistical
- Configurable sensitivity (Z-threshold, window sizes)
- Two-stage severity (
WARNING→DRIFTING) to suppress false-positive pages - Fully integrated into the Webhook Daemon — zero additional services required
┌──────────────────────────────────────────────────────────────────────────┐
│ Webhook Daemon (:8008) │
│ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ DriftStore │ → │ DriftDetector │ (Welch Z-test per metric) │
│ │ (reads KPIs │ │ (evaluates │ │
│ │ from outputs/│ │ time windows) │ │
│ │ **/kpis.json)│ └──────────────────┘ │
│ └──────────────┘ │ │
│ ↓ │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ APScheduler │ → │ Slack Notifier │ │
│ │ (every 6 hours) │ │ (POST to webhook) │ │
│ └──────────────────┘ └─────────────────────┘ │
│ │ │
│ GET /api/v1/drift-status │
│ ↓ │
│ ┌──────────────────┐ │
│ │ DriftMonitorBanner│ (React component, polls 5min) │
│ │ in Insights Portal│ │
│ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
File: services/eval/drift/store.py
The DriftStore does not write any data — it is a read-only scanner that discovers
kpis.json files in the outputs tree.
for kpis_file in sorted(self._root.rglob("kpis.json")):
...Files are scanned via rglob("kpis.json") and sorted lexicographically. Because outputs are
named outputs/run_YYYYMMDD_HHMMSS_*/, lexicographic order is identical to chronological
order — oldest runs first, which the detector requires.
{
"run_id": "run_20260327_120000_abc",
"metrics": {
"entity_overlap": { "average": 0.452 },
"structural_connectivity": { "average": 0.831 },
"hub_noise_penalty": { "average": 0.041 }
}
}The RunKPIRecord dataclass holds the three GCR sub-scores for a single run. Records with
missing or NaN values for individual fields are included with None — the detector handles
them gracefully.
File: services/eval/drift/detector.py
For each metric
Step 1: Partition the sorted run history into two non-overlapping windows:
-
Baseline window
$B$ : the oldest N runs (default: up to 100 runs) -
Recent window
$R$ : the most recent K runs (default: last 50 runs)
The partition index guarantees no overlap:
split = max(self.min_baseline, len(records) - self.recent_k)
baseline_records = records[:split][-self.baseline_n:]
recent_records = records[split:][-self.recent_k:]Step 2: Compute baseline moments with Bessel-corrected sample variance:
Step 3: Compute the Welch one-sample Z-score:
This measures how many standard errors the recent cohort mean is from the baseline mean.
Dividing by
Step 4: Direction-aware flagging:
| Metric | Direction | Flag condition |
|---|---|---|
|
|
Higher-is-better |
|
|
|
Higher-is-better | |
|
|
Lower-is-better |
|
Default threshold:
Step 5: Severity roll-up:
| Flagged metrics | Status | Meaning |
|---|---|---|
| 0 | HEALTHY |
All metrics within baseline norms |
| 1 | WARNING |
One metric degraded; monitor closely |
| ≥ 2 | DRIFTING |
Multiple metrics degraded; immediate investigation required |
< min_baseline records |
INSUFFICIENT_DATA |
Too few runs to form a baseline; safe no-alert state |
Under a null hypothesis (no real drift), each Z-test fires a false positive ~5% of the time
(DRIFTING, the joint
false-positive rate drops to ~0.25%. WARNING on a single flag is informational — it does
not trigger a Slack alert by default.
if sigma_B == 0.0:
# Regular graph — use epsilon to avoid division by zero
b_std = 1e-9When all baseline values are identical (
File: services/eval/drift/scheduler.py
The scheduler is integrated into the Webhook Daemon's FastAPI lifespan:
# On startup: run immediate check + start background scheduler
run_check_now(outputs_root) # answers /drift-status before first scheduled job
scheduler = create_scheduler(outputs_root)
scheduler.start()
# On shutdown:
scheduler.shutdown()| Environment variable | Default | Effect |
|---|---|---|
DRIFT_CHECK_INTERVAL_HOURS |
6 |
Hours between scheduled checks (minimum 1) |
run_check_now() is called once during the lifespan startup event, before the first scheduled
job fires. This ensures GET /api/v1/drift-status returns a meaningful result immediately rather
than PENDING for the first 6 hours.
The _last_result module-level singleton is written by the scheduler thread and read by
FastAPI request handlers. Python's GIL makes the assignment atomic for the purposes of
this best-effort status endpoint — stale-but-safe reads are acceptable here.
File: services/eval/drift/notifier.py
# Set in .env.compose or as a container environment variable
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...When SLACK_WEBHOOK_URL is unset or empty, fire_slack_alert() is a no-op — the system
never raises on missing configuration.
Slack messages use mrkdwn formatting and include:
🚨 Data Drift Alert — `DRIFTING`
• Entity Overlap (Sₑ): recent 0.312 vs baseline 0.452 (-31.0%, z=-4.21)
• Structural Connectivity (Sᶜ): recent 0.601 vs baseline 0.831 (-27.7%, z=-3.88)
⚠️ Action Required: Analyze failing queries and inject new domain documents
into the Knowledge Graph.
_Checked at: 2026-03-27T14:00:00+00:00_
Alerts fire for WARNING and DRIFTING status. HEALTHY and INSUFFICIENT_DATA never
trigger alerts — the system is silent by default and only speaks when action is needed.
- Uses
httpx.postwith a 10-second timeout resp.raise_for_status()— HTTP errors are caught and logged at ERROR level- Returns
Trueon successful delivery,Falseon any failure - Never raises — alert delivery failure does not affect the drift check result
File: insights-portal/src/components/DriftMonitorBanner.tsx
The banner polls GET /api/v1/drift-status every 5 minutes and renders a status indicator
above the Executive Overview panel.
| Status | Appearance | Default state |
|---|---|---|
HEALTHY |
Green border, ✅ icon | Collapsed |
WARNING |
Amber border, |
Expanded |
DRIFTING |
Red border, 🚨 icon, metric detail + action CTA | Expanded |
INSUFFICIENT_DATA |
Grey border, 📊 icon | Collapsed |
PENDING |
Grey border, ⏳ icon | Collapsed |
UNAVAILABLE |
Grey border, — icon | Collapsed |
When WARNING or DRIFTING, the banner renders a breakdown row per flagged metric showing:
- Metric label (e.g., "Structural Connectivity (Sᶜ)")
- Recent mean vs baseline mean
- Delta percentage (colour-coded: red for degradation)
- Z-score
- A
flagged: true/falsebadge
The webhook base URL is resolved from the Vite environment variable:
const WEBHOOK_BASE = import.meta.env.VITE_WEBHOOK_BASE_URL ?? 'http://localhost:8008'Override for production deployments by setting VITE_WEBHOOK_BASE_URL at build time.
| Scope | Tests | Key test files |
|---|---|---|
eval-pipeline/tests/ |
369 | test_graph_context_relevance.py, test_graph_store.py, test_drift_*, test_v{7-13}_components.py |
services/tests/ |
364 | test_common_*.py, eval/, kg/, ws/, test_validate_dev_parity.py |
| Total | 733 | Collected in 4.10s |
| Mechanism | Guarantee |
|---|---|
pytest-xdist (-n auto) |
Each worker process gets a unique temp directory |
tmp_path fixture (function-scoped) |
Every test's SQLite DB is in an isolated temp path |
asyncio.new_event_loop() per async test |
No shared event loop state between tests |
HF_HUB_OFFLINE=1 + TRANSFORMERS_OFFLINE=1 |
Network calls fail loudly rather than race on shared downloads |
# Graph Context Relevance tests only
docker compose -f docker-compose.test.yml run --rm test \
pytest eval-pipeline/tests/test_graph_context_relevance.py -v
# Drift detection tests only
docker compose -f docker-compose.test.yml run --rm test \
pytest eval-pipeline/tests/ -k "drift" -v
# Services tests only
docker compose -f docker-compose.test.yml run --rm test \
pytest services/tests/ -v
# With coverage
docker compose -f docker-compose.test.yml run --rm test \
pytest -n auto --cov=eval-pipeline/src --cov=services \
--cov-report=term-missing --cov-report=htmlThe DriftDetector is configurable at construction time. Production defaults are set via
environment variables on the webhook service in .env.compose:
| Parameter | Env var | Default | Effect |
|---|---|---|---|
baseline_n |
— | 100 |
Max runs forming the baseline window |
recent_k |
— | 50 |
Runs in the recent comparison window |
z_threshold |
— | 2.0 |
Standard errors required to flag a metric |
min_baseline |
— | 5 |
Minimum runs before checking; returns INSUFFICIENT_DATA otherwise |
| Check interval | DRIFT_CHECK_INTERVAL_HOURS |
6 |
Hours between scheduled checks |
| Slack alerts | SLACK_WEBHOOK_URL |
(unset) | Slack incoming webhook URL; leave unset to disable |
| Scenario | Recommendation |
|---|---|
| Very noisy corpus (frequent small fluctuations) | Increase z_threshold to 2.5–3.0; increase recent_k to 100 |
| High-stakes production (alert early) | Decrease z_threshold to 1.8; decrease recent_k to 20 |
| Small corpus (< 20 runs) | Decrease min_baseline to 3; set baseline_n to all available runs |
| On-call fatigue risk | Keep z_threshold ≥ 2.5 for DRIFTING; accept WARNING as info-only |
| Metric | Dropped? | Likely cause |
|---|---|---|
|
|
↓ | New documents use different vocabulary; query expansion no longer matches content |
|
|
↓ | KG ingestion failed partially; new chunks not linked to existing nodes |
|
|
↑ | New high-connectivity documents were ingested, creating hub nodes that dominate retrieval |
-
WARNING(single metric): Review the flagged metric's trend chart in the Insights Portal Analytics view. Check the most recent 10 run outputs for anomalies. -
DRIFTING(2+ metrics):- Inspect
outputs/run_*/kpis.jsonfor the flagged run IDs shown in the alert. - Run the QA Debugger on the most recent evaluation run to identify failing retrieval patterns.
- Re-ingest affected document sections via the ingestion service (
POST http://localhost:8001/documents). - After re-ingestion, force an immediate drift check:
curl -X POST http://localhost:8008/api/v1/drift-check-now
- Inspect
curl -s http://localhost:8008/api/v1/drift-status | python3 -m json.tooldocker compose -f docker-compose.services.yml logs webhook --since 1h | grep -i driftcurl -X POST http://localhost:8008/api/v1/drift-check-nowfind outputs/ -name "kpis.json" | sort | while read f; do
echo "=== $f ==="; python3 -m json.tool "$f"; echo
doneThe baseline is derived from the oldest files in outputs/. To reset:
# Archive old runs
mkdir -p outputs/archive
mv outputs/run_2025* outputs/archive/ # adjust glob as needed
# Force a new baseline check
curl -X POST http://localhost:8008/api/v1/drift-check-now