Skip to content

Commit 48d10b4

Browse files
authored
feat: Connection Monitor - always-on latency monitoring (#195)
* docs: add Connection Monitor design spec and implementation plan Design spec for issue #194 - native always-on latency monitor. MVP scope agreed with external review: 5s default polling, no aggregation tables in V1, minimal event set, TCP-first UX framing. Implementation plan covers 14 tasks across 5 chunks: scaffold, storage, probe engine, events, collector, routes, Docker CAP_NET_RAW, i18n, settings, card, detail view, testing. * feat(connection-monitor): add module scaffold with manifest * feat(connection-monitor): add probe engine with ICMP/TCP auto-detection * feat(connection-monitor): add storage layer for targets and samples * feat(connection-monitor): add event rules for outage detection and recovery * feat(connection-monitor): add collector with parallel probing and event integration * feat(connection-monitor): add API routes for targets, samples, outages, export, capability * feat(connection-monitor): add NET_RAW capability for ICMP probing * feat(connection-monitor): add i18n translations (EN/DE/FR/ES) * feat(connection-monitor): add settings page template * feat(connection-monitor): add dashboard summary card * feat(connection-monitor): add detail view with latency, loss, availability charts * fix: move Connection Monitor from Extensions to core analysis nav * fix: use clean view ID 'connection-monitor' instead of module prefix * fix(connection-monitor): allow creating targets without host for add-then-edit UX * feat(connection-monitor): add demo data with evening congestion + outage scenario * feat(connection-monitor): rewrite charts to PingPlotter style - Combined all targets into single uPlot chart with overlaid series - Added threshold zones (green <30ms, yellow 30-100ms, red >100ms) - Added red vertical loss markers via custom uPlot draw hook plugin - Removed target tab switching, fetch all targets in parallel - Combined outage log across all targets with target name column - Added per-target CSV export links - Extended chart-engine.js renderChart() to accept custom plugins via opts - Added cm_outage_target i18n key to all 4 languages * fix(connection-monitor): use correct duration_seconds field in outage log The API returns duration_seconds but the JS code referenced duration_s, causing all outage durations to show as dashes instead of actual values. * feat(connection-monitor): group overlapping outages across targets Outages with start times within 60s are merged into a single row showing all affected targets (e.g. "Cloudflare DNS, Google DNS"). Reduces log spam when an ISP outage affects multiple targets. * feat(connection-monitor): add stats cards, zoom, improved CSV export, UX fixes - Stats cards: Avg/Min/Max/P95 latency, packet loss %, sample count - Chart zoom: drag-to-select zoom via uPlot plugin, double-click reset - X-axis: show date+time for ranges >24h instead of just time - CSV export: human-readable timestamps, target label in filename - TCP mode badge with glossary popover explaining ICMP vs TCP - Outage table: fixed column widths, duration right-aligned - Export: use <a download> for reliable file download * feat(connection-monitor): add per-target stats, fix chart zoom, fault diagnosis - Per-target stats table comparing Gateway vs external targets - Automatic fault diagnosis (external vs internal/ISP issue) - Fix chart zoom: zoomable option in chart-engine with auto:false prevents uPlot's auto-ranging from overriding setScale - Zoom plugin: drag-to-select + double-click reset via _zoomRange - i18n: add per-target and diagnosis keys (EN/DE/FR/ES) * fix(connection-monitor): add Cache-Control no-store to API responses Prevents browser from caching API responses with stale target IDs across container rebuilds. * fix(connection-monitor): move dashboard card above donut charts Place Connection Monitor card between BNetzA and channel health donuts instead of below them in the generic module cards section. * fix: unify dashboard card grid - BNetzA and CM same size as metric cards Move BNetzA and Connection Monitor cards into the main metrics-grid so all cards have consistent sizing. Previously BNetzA was full-width and CM was in a separate section below the donut charts. * fix(connection-monitor): address code review findings - Add @require_auth to all read/export endpoints (consistent with other modules) - Fix capability endpoint to detect actual probe method instead of hardcoding TCP - Disable new targets with empty host to prevent false timeouts - Remove 10k sample limit in detail view to avoid truncating 24h/7d data - Include connection_monitor.db in backup/restore with VACUUM INTO consistency * fix(connection-monitor): auto-enable target on host update, use configured probe method - Auto-set enabled=True when a non-empty host is provided via PUT (fixes blank-host regression where new targets stayed disabled after host entry) - Read configured probe method from config for capability endpoint instead of always using auto-detection (badge now reflects user's explicit setting) * fix(connection-monitor): wire up packet loss warnings, use configured outage threshold - Call check_window_stats() in collector after probe checks to emit cm_packet_loss_warning events when loss exceeds configured threshold - Read outage threshold from config in outages API instead of hardcoding 5, so the outage table and collector events agree on the threshold value * test(connection-monitor): fix CI failure, add auth and add-target flow tests - Set DATA_DIR to tmp_path in collector tests to avoid /data permission error on CI runners - Add auth enforcement tests verifying all endpoints return 401 when admin_password is configured but no session is provided - Add add-target flow tests: disabled without host, enabled with host, auto-enable on host update via PUT * fix: change BQM icon to chart-spline, reorder sidebar navigation - Change BQM Graphs icon from 'activity' to 'chart-spline' to distinguish from Connection Monitor which also uses 'activity' - Reorder monitoring sidebar: Home, Event Log, Correlation, Signal Trends, Channels, Connection Monitor, Segment Util, Before/After, Gaming Quality Index, Modulation --------- Co-authored-by: itsDNNS <itsDNNS@users.noreply.github.com>
1 parent 5900fac commit 48d10b4

33 files changed

+5693
-53
lines changed

app/collectors/demo.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@ def _seed_demo_data(self):
206206
self._seed_incident_containers(now)
207207
self._seed_bnetz_measurements(now)
208208
self._seed_weather_data(now)
209+
self._seed_connection_monitor_data(now)
209210

210211
def _seed_history(self, now):
211212
"""Generate 9 months of historical snapshots (every 15 min)."""
@@ -888,6 +889,107 @@ def _seed_weather_data(self, now):
888889
)
889890
log.info("Demo: seeded %d weather records (%d days)", len(records), days)
890891

892+
def _seed_connection_monitor_data(self, now):
893+
"""Seed 7 days of Connection Monitor data showing a typical cable troubleshooting scenario.
894+
895+
Story: User notices evening lag and dropped video calls. Enables Connection Monitor.
896+
Gateway is always fine (proves home network OK), but external targets show:
897+
- Evening congestion (19-23h): latency spikes, occasional packet loss
898+
- Two short outages (~1-3 min) on different days
899+
- One longer outage (~8 min) that prompted the investigation
900+
"""
901+
try:
902+
from app.modules.connection_monitor.storage import ConnectionMonitorStorage
903+
except ImportError:
904+
log.debug("Demo: connection_monitor module not available, skipping")
905+
return
906+
907+
data_dir = os.path.dirname(self._storage.db_path)
908+
cm_db = os.path.join(data_dir, "connection_monitor.db")
909+
cm = ConnectionMonitorStorage(cm_db)
910+
911+
# Purge existing demo targets/samples
912+
with cm._connect() as conn:
913+
conn.execute("DELETE FROM connection_samples")
914+
conn.execute("DELETE FROM connection_targets")
915+
916+
# Create targets
917+
gw_id = cm.create_target("Gateway", "192.168.178.1")
918+
cf_id = cm.create_target("Cloudflare DNS", "1.1.1.1")
919+
gg_id = cm.create_target("Google DNS", "8.8.8.8")
920+
921+
rng = random.Random(2026)
922+
days = 7
923+
interval_s = 10 # 10s between samples
924+
samples_per_day = 86400 // interval_s # 8640
925+
926+
# Outage windows (day_offset, hour_start, duration_minutes)
927+
outages = [
928+
(2, 20.5, 2), # Day 3: short 2-min outage during evening
929+
(4, 21.0, 3), # Day 5: 3-min outage
930+
(5, 19.75, 8), # Day 6: the big one - 8 min outage that triggered investigation
931+
]
932+
933+
rows = []
934+
for d in range(days):
935+
day_start = now - timedelta(days=days - d)
936+
for s in range(samples_per_day):
937+
ts = day_start.timestamp() + s * interval_s
938+
hour = (s * interval_s / 3600) % 24
939+
940+
# Check if we're in an outage window (external targets only)
941+
in_outage = False
942+
for o_day, o_hour, o_dur in outages:
943+
if d == o_day and o_hour <= hour < o_hour + o_dur / 60:
944+
in_outage = True
945+
break
946+
947+
# Evening congestion window
948+
evening = 19 <= hour < 23
949+
late_evening = 20 <= hour < 22 # worst window
950+
951+
# --- Gateway: always fast, 1-3ms ---
952+
gw_lat = round(rng.uniform(0.8, 3.0), 2)
953+
rows.append((gw_id, ts, gw_lat, False, "tcp"))
954+
955+
# --- External targets ---
956+
for tid in (cf_id, gg_id):
957+
base = 11.0 if tid == cf_id else 14.0
958+
959+
if in_outage:
960+
# Full timeout
961+
rows.append((tid, ts, None, True, "tcp"))
962+
elif late_evening:
963+
# Heavy congestion: spikes + occasional loss
964+
if rng.random() < 0.04:
965+
rows.append((tid, ts, None, True, "tcp"))
966+
else:
967+
spike = rng.uniform(30, 250) if rng.random() < 0.3 else rng.uniform(0, 20)
968+
lat = round(base + spike, 2)
969+
rows.append((tid, ts, lat, False, "tcp"))
970+
elif evening:
971+
# Moderate congestion: elevated latency, rare loss
972+
if rng.random() < 0.008:
973+
rows.append((tid, ts, None, True, "tcp"))
974+
else:
975+
spike = rng.uniform(5, 60) if rng.random() < 0.15 else rng.uniform(0, 8)
976+
lat = round(base + spike, 2)
977+
rows.append((tid, ts, lat, False, "tcp"))
978+
else:
979+
# Normal: stable low latency
980+
lat = round(base + rng.uniform(-2, 3), 2)
981+
rows.append((tid, ts, lat, False, "tcp"))
982+
983+
# Bulk insert
984+
with cm._connect() as conn:
985+
conn.executemany(
986+
"INSERT INTO connection_samples (target_id, timestamp, latency_ms, timeout, probe_method) "
987+
"VALUES (?, ?, ?, ?, ?)",
988+
rows,
989+
)
990+
991+
log.info("Demo: seeded %d connection monitor samples (%d days, 3 targets)", len(rows), days)
992+
891993
@staticmethod
892994
def _generate_bqm_png(width=800, height=200, seed=0):
893995
"""Generate a simple BQM-style quality graph as PNG bytes."""

app/modules/backup/backup.py

Lines changed: 35 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,10 @@
1818
log = logging.getLogger("docsis.backup")
1919

2020
# Files under data_dir to include in backups
21-
DATA_FILES = ["docsis_history.db", "config.json", ".config_key", ".session_key"]
21+
DATA_FILES = [
22+
"docsis_history.db", "connection_monitor.db",
23+
"config.json", ".config_key", ".session_key",
24+
]
2225

2326
BACKUP_META_FILE = "backup_meta.json"
2427
FORMAT_VERSION = 1
@@ -54,32 +57,33 @@ def _get_table_counts(db_path):
5457
return counts
5558

5659

57-
def _vacuum_db(data_dir, dest_path):
58-
"""Create a consistent copy of the database using VACUUM INTO.
60+
def _vacuum_db(data_dir, db_name, dest_path):
61+
"""Create a consistent copy of a database using VACUUM INTO.
5962
60-
Also removes demo data (is_demo=1) from the copy.
63+
Also removes demo data (is_demo=1) from the copy when applicable.
6164
"""
62-
src = os.path.join(data_dir, "docsis_history.db")
65+
src = os.path.join(data_dir, db_name)
6366
if not os.path.exists(src):
6467
return False
6568

6669
conn = sqlite3.connect(src)
6770
conn.execute(f"VACUUM INTO '{dest_path}'")
6871
conn.close()
6972

70-
# Remove demo data from copy
71-
copy_conn = sqlite3.connect(dest_path)
72-
demo_tables = [
73-
"snapshots", "events", "journal_entries", "incidents",
74-
"speedtest_results", "bqm_graphs", "bnetz_measurements",
75-
]
76-
for table in demo_tables:
77-
try:
78-
copy_conn.execute(f"DELETE FROM [{table}] WHERE is_demo = 1") # noqa: S608
79-
except sqlite3.OperationalError:
80-
pass # table may not exist or lack is_demo column
81-
copy_conn.commit()
82-
copy_conn.close()
73+
# Remove demo data from copy (only relevant for main DB)
74+
if db_name == "docsis_history.db":
75+
copy_conn = sqlite3.connect(dest_path)
76+
demo_tables = [
77+
"snapshots", "events", "journal_entries", "incidents",
78+
"speedtest_results", "bqm_graphs", "bnetz_measurements",
79+
]
80+
for table in demo_tables:
81+
try:
82+
copy_conn.execute(f"DELETE FROM [{table}] WHERE is_demo = 1") # noqa: S608
83+
except sqlite3.OperationalError:
84+
pass # table may not exist or lack is_demo column
85+
copy_conn.commit()
86+
copy_conn.close()
8387
return True
8488

8589

@@ -89,28 +93,35 @@ def create_backup(data_dir):
8993
Returns:
9094
BytesIO containing the .tar.gz archive.
9195
"""
96+
db_files = {"docsis_history.db", "connection_monitor.db"}
97+
9298
buf = BytesIO()
9399
with tempfile.TemporaryDirectory() as tmp:
94-
db_copy = os.path.join(tmp, "docsis_history.db")
95-
has_db = _vacuum_db(data_dir, db_copy)
100+
# Vacuum all databases for consistent copies
101+
vacuumed = {}
102+
for db_name in db_files:
103+
db_copy = os.path.join(tmp, db_name)
104+
vacuumed[db_name] = _vacuum_db(data_dir, db_name, db_copy)
96105

106+
main_copy = os.path.join(tmp, "docsis_history.db")
97107
meta = {
98108
"magic": MAGIC,
99109
"format_version": FORMAT_VERSION,
100110
"timestamp": datetime.now(timezone.utc).isoformat(),
101111
"app_version": _get_app_version(),
102-
"tables": _get_table_counts(db_copy) if has_db else {},
112+
"tables": _get_table_counts(main_copy) if vacuumed.get("docsis_history.db") else {},
103113
}
104114
meta_path = os.path.join(tmp, BACKUP_META_FILE)
105115
with open(meta_path, "w") as f:
106116
json.dump(meta, f, indent=2)
107117

108118
with tarfile.open(fileobj=buf, mode="w:gz") as tar:
109119
tar.add(meta_path, arcname=BACKUP_META_FILE)
110-
if has_db:
111-
tar.add(db_copy, arcname="docsis_history.db")
120+
for db_name, has_db in vacuumed.items():
121+
if has_db:
122+
tar.add(os.path.join(tmp, db_name), arcname=db_name)
112123
for fname in DATA_FILES:
113-
if fname == "docsis_history.db":
124+
if fname in db_files:
114125
continue # already added via vacuum copy
115126
fpath = os.path.join(data_dir, fname)
116127
if os.path.exists(fpath):

app/modules/bqm/manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
},
2020
"menu": {
2121
"label_key": "docsight.bqm.bqm_title",
22-
"icon": "activity",
22+
"icon": "chart-spline",
2323
"order": 22
2424
}
2525
}

app/modules/connection_monitor/__init__.py

Whitespace-only changes.
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
"""Collector for Connection Monitor - orchestrates probing, storage, and events."""
2+
3+
import logging
4+
import os
5+
import time
6+
from concurrent.futures import ThreadPoolExecutor, as_completed
7+
8+
from app.collectors.base import Collector, CollectorResult
9+
from app.modules.connection_monitor.event_rules import ConnectionEventRules
10+
from app.modules.connection_monitor.probe import ProbeEngine
11+
from app.modules.connection_monitor.storage import ConnectionMonitorStorage
12+
13+
logger = logging.getLogger(__name__)
14+
15+
# Run retention cleanup every 15 minutes, not every collect cycle
16+
_CLEANUP_INTERVAL_S = 900
17+
18+
19+
class ConnectionMonitorCollector(Collector):
20+
"""Always-on latency collector with per-target timing."""
21+
22+
name = "connection_monitor"
23+
24+
def __init__(self, config_mgr, storage, web, **kwargs):
25+
super().__init__(poll_interval_seconds=1)
26+
self._config_mgr = config_mgr
27+
self._core_storage = storage
28+
self._web = web
29+
30+
method = config_mgr.get("connection_monitor_probe_method", "auto")
31+
self._probe = ProbeEngine(method=method)
32+
self._last_probe: dict[int, float] = {}
33+
self._last_cleanup = 0.0
34+
self._event_rules = ConnectionEventRules(
35+
outage_threshold=int(config_mgr.get("connection_monitor_outage_threshold", 5)),
36+
loss_warning_pct=float(config_mgr.get("connection_monitor_loss_warning_pct", 2.0)),
37+
)
38+
39+
data_dir = os.environ.get("DATA_DIR", "/data")
40+
db_path = os.path.join(data_dir, "connection_monitor.db")
41+
self._cm_storage = ConnectionMonitorStorage(db_path)
42+
43+
self._seeded = False
44+
45+
def is_enabled(self) -> bool:
46+
return bool(self._config_mgr.get("connection_monitor_enabled", False))
47+
48+
def should_poll(self) -> bool:
49+
"""Always return True - per-target timing is managed internally."""
50+
return True
51+
52+
def collect(self) -> CollectorResult:
53+
try:
54+
self._ensure_default_targets()
55+
targets = [
56+
t for t in self._cm_storage.get_targets() if t["enabled"]
57+
]
58+
if not targets:
59+
return CollectorResult.ok(self.name, None)
60+
61+
# Determine which targets are due
62+
now = time.time()
63+
due = []
64+
for t in targets:
65+
interval_s = t["poll_interval_ms"] / 1000.0
66+
last = self._last_probe.get(t["id"], 0)
67+
if now - last >= interval_s:
68+
due.append(t)
69+
70+
if not due:
71+
return CollectorResult.ok(self.name, None)
72+
73+
# Probe all due targets in parallel
74+
samples = self._probe_targets(due, now)
75+
76+
# Save samples
77+
if samples:
78+
self._cm_storage.save_samples(samples)
79+
80+
# Check events
81+
self._check_events(samples)
82+
83+
# Periodic retention cleanup
84+
if now - self._last_cleanup >= _CLEANUP_INTERVAL_S:
85+
retention = int(
86+
self._config_mgr.get("connection_monitor_retention_days", 0)
87+
)
88+
self._cm_storage.cleanup(retention)
89+
self._last_cleanup = now
90+
91+
return CollectorResult.ok(self.name, {"probed": len(due)})
92+
except Exception as exc:
93+
logger.exception("Connection Monitor collect error")
94+
return CollectorResult.failure(self.name, str(exc))
95+
96+
def _probe_targets(self, targets: list[dict], now: float) -> list[dict]:
97+
"""Probe targets in parallel and return sample dicts."""
98+
samples = []
99+
tcp_port = int(self._config_mgr.get("connection_monitor_tcp_port", 443))
100+
101+
with ThreadPoolExecutor(
102+
max_workers=max(len(targets), 1),
103+
thread_name_prefix="cm-probe",
104+
) as pool:
105+
futures = {
106+
pool.submit(self._probe.probe, t["host"], t.get("tcp_port", tcp_port)): t
107+
for t in targets
108+
}
109+
for future in as_completed(futures, timeout=5):
110+
target = futures[future]
111+
try:
112+
result = future.result()
113+
except Exception:
114+
result = type("R", (), {"latency_ms": None, "timeout": True, "method": "error"})()
115+
116+
self._last_probe[target["id"]] = now
117+
samples.append({
118+
"target_id": target["id"],
119+
"timestamp": now,
120+
"latency_ms": result.latency_ms,
121+
"timeout": result.timeout,
122+
"probe_method": result.method,
123+
})
124+
return samples
125+
126+
def _check_events(self, samples: list[dict]):
127+
"""Run event rules and save any emitted events."""
128+
all_events = []
129+
for s in samples:
130+
events = self._event_rules.check_probe_result(
131+
target_id=s["target_id"], timeout=s["timeout"]
132+
)
133+
all_events.extend(events)
134+
135+
# Check windowed packet loss stats per probed target
136+
window_seconds = 60
137+
checked_targets = set()
138+
for s in samples:
139+
tid = s["target_id"]
140+
if tid in checked_targets:
141+
continue
142+
checked_targets.add(tid)
143+
summary = self._cm_storage.get_summary(tid, window_seconds=window_seconds)
144+
loss_pct = summary.get("packet_loss_pct") or 0.0
145+
events = self._event_rules.check_window_stats(
146+
target_id=tid, packet_loss_pct=loss_pct, window_seconds=window_seconds,
147+
)
148+
all_events.extend(events)
149+
150+
if all_events and hasattr(self._core_storage, "save_events"):
151+
self._core_storage.save_events(all_events)
152+
153+
def _ensure_default_targets(self):
154+
"""Seed default targets on first enable."""
155+
if self._seeded:
156+
return
157+
self._seeded = True
158+
if not self._cm_storage.get_targets():
159+
self._cm_storage.create_target("Cloudflare DNS", "1.1.1.1")
160+
self._cm_storage.create_target("Google DNS", "8.8.8.8")
161+
logger.info("Connection Monitor: seeded default targets")
162+
163+
def get_storage(self) -> ConnectionMonitorStorage:
164+
"""Expose storage for routes."""
165+
return self._cm_storage
166+
167+
def get_probe(self) -> ProbeEngine:
168+
"""Expose probe engine for capability endpoint."""
169+
return self._probe

0 commit comments

Comments
 (0)