You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Observability Hub currently utilizes a fragmented approach for host-level telemetry:
10
10
11
-
-**Grafana Alloy:** A standalone agent scraping Tailscale logs via `journalctl`. Despite its minimal workload, it requires significant reserved resources (Requests: 20m CPU / 114Mi RAM).
11
+
-**Grafana Alloy:** A standalone agent scraping Tailscale logs via `journalctl`. Despite its minimal workload, its actual idle consumption was approximately 10m CPU and 50Mi RAM, but it required significant reserved resources (Requests: 20m CPU / 114Mi RAM).
12
12
-**Existing `system-metrics`:** A Go service collecting host stats via `gopsutil` every minute, leading to constant database writes and unnecessary resource overhead.
13
13
-**`systemd` units:** Managing legacy collection scripts on the host adds operational complexity.
14
14
@@ -20,23 +20,31 @@ Consolidate all host-level observability responsibilities into a single, re-arch
20
20
21
21
### Key Architectural Shifts
22
22
23
-
-**Thanos-Centric Metrics:**Shift host metric collection (CPU, RAM, Disk, Network, Temperature) from direct `gopsutil` polling to querying **Thanos Query**. This leverages the unified API for both real-time and long-term storage (MinIO).
23
+
-**Thanos-Centric Metrics:**Host metric collection (CPU, RAM, Disk, Network, Temperature) is now retrieved from **Prometheus** (exposed via Thanos Query). This leverages the unified API for both real-time and long-term storage (MinIO).
24
24
-**Batch Processing Model:** Move from 1-minute continuous polling to a **15-minute batch interval** (as a starting point). The service wakes up every 15 minutes, performs a range query with `step=1m` to maintain granularity, and batch-inserts results into PostgreSQL.
25
25
-**Unified Tailscale Collection:** Incorporate Tailscale status and log collection (via `exec.Command`) directly into the Go service, exposing them via OpenTelemetry and PostgreSQL.
26
-
-**Resource Optimization:** Configure the new service with tight resource requests (e.g., 10m CPU / 40Mi RAM), releasing significant guaranteed memory back to the cluster.
26
+
-**Resource Optimization:** Configure the new service with tight resource requests (10m CPU / 40Mi RAM), releasing significant guaranteed memory back to the cluster.
27
27
28
28
### Rationale
29
29
30
30
-**Efficiency of Batch Processing:** Research confirms that moving from a continuous 1-minute polling cycle to a 15-minute batch interval significantly reduces the average CPU duty cycle. The service transitions from a constant baseline draw to a "wake-perform-sleep" model, making it virtually invisible to the CPU scheduler for 99% of its operational life.
31
-
-**Optimization of Reserved Resources:** Empirical observation via `kubectl top` reveals that while existing agents like Alloy have low *actual* usage when idle (~54MiB), their high *reserved* requests (~114MiB) tie up "dead" RAM that is unavailable to other workloads. A specialized Go service allows for a high-fidelity reservation (40MiB), returning significant guaranteed memory to the cluster nodes.
31
+
-**Optimization of Reserved Resources:** Empirical observation via `kubectl top` reveals a significant reduction in both idle and reserved resource allocation. The specialized Go service for Collectors now consumes ~2m CPU / 10Mi RAM when idle, with requests set to 10m CPU / 40Mi RAM, effectively returning significant guaranteed CPU and memory resources back to the cluster nodes.
32
+
33
+
| Component | Idle CPU / RAM | Reserved CPU / RAM |
-**Data Parity & Schema Consistency:** By utilizing PromQL `query_range` with a `step=1m`, we maintain the high-resolution data (1-minute granularity) required for accurate FinOps analysis while gaining the operational benefits of batch processing.
33
41
-**Surgical Consolidation:** This approach allows us to integrate specialized collection (Tailscale, hardware temperatures) into a single path, eliminating the need for three separate management domains (Alloy, systemd, and legacy Go services).
34
42
35
43
## Consequences
36
44
37
45
### Positive
38
46
39
-
-**Significant Resource Savings**: Frees up ~70-100MiB of reserved RAM per node.
47
+
-**Significant Resource Savings**: Frees up approximately 8m CPU and 74Mi RAM in reserved resources per node, based on the difference between Alloy's prior requests (20m CPU / 114Mi RAM) and Collectors' new requests (10m CPU / 40Mi RAM), with even larger savings in actual idle usage.
40
48
-**Operational Simplicity**: Replaces three legacy components (Alloy, old `system-metrics`, `systemd` units) with one unified Go binary.
41
49
-**FinOps Readiness**: Provides a curated, efficient historical data source in PostgreSQL for electricity cost analysis.
42
50
-**Architectural Alignment**: Standardizes on Go and the "library-first" pattern.
@@ -48,7 +56,7 @@ Consolidate all host-level observability responsibilities into a single, re-arch
48
56
49
57
## Verification
50
58
51
-
-[]**Resource Usage:** Monitor `collectors` pod via `kubectl top` and ensure it operates within the new 40Mi/80Mi RAM limits.
52
-
-[]**Data Parity:** Confirm `system_metrics` table in PostgreSQL receives 1-minute interval data for all four metric types plus hardware temperature.
53
-
-[]**Tailscale Flow:** Verify `tailscale_*`metrics appear in both OTel and PostgreSQL.
54
-
-[]**Decommissioning:** Confirm `alloy` and legacy `system-metrics` units are stopped and removed.
59
+
-[x]**Resource Usage:** Monitor `collectors` pod via `kubectl top` and ensure it operates within the new 40Mi/80Mi RAM limits.
60
+
-[x]**Data Parity:** Confirm `system_metrics` table in PostgreSQL receives 1-minute interval data for all four metric types plus hardware temperature.
61
+
-[x]**Tailscale Flow:** Verify `tailscale_*`logs appear in Grafana (via Loki) and `collectors.tailscale.active` metrics appear in Grafana (via Prometheus/OTel).
62
+
-[x]**Decommissioning:** Confirm `alloy` and legacy `system-metrics` units are stopped and removed.
Copy file name to clipboardExpand all lines: page/content/evolution.yaml
+9-8Lines changed: 9 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -203,23 +203,23 @@ chapters:
203
203
description: |
204
204
- Completed high-fidelity OpenTelemetry instrumentation for the 'proxy' service, transitioning to dynamic span naming and deep-dive diagnostics.
205
205
- Engineered a synthetic validation suite that simulates global traffic (Region, Timezone, Device) to stress-test Grafana Tempo storage and Grafana visualization.
206
-
- Resolved a critical "silent" failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
206
+
- Resolved a critical 'silent' failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
207
207
- date: "2026-02-13"
208
208
title: "Storage Scalability & Security Hardening"
209
209
description: |
210
210
- Scaled telemetry persistence by migrating from restricted local disks to professional S3-compatible object storage (MinIO), ensuring long-term data reliability.
211
-
- Established a "Safe-by-Default" infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
211
+
- Established a 'Safe-by-Default' infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
212
212
213
213
- title: "Platform Maturity & Reusability"
214
-
intro: "Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services."
214
+
intro: "Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services, culminating in a unified, OpenTelemetry-native architecture."
215
215
timeline:
216
216
- date: "2026-02-16"
217
217
title: "Proposal: Modular Library Architecture"
218
218
artifacts:
219
219
- name: "ADR 014: Library-First Service Architecture"
- A strategic proposal to organize the platform's core features into reusable "building blocks." This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.
222
+
- A strategic proposal to organize the platform's core features into reusable 'building blocks.' This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.
223
223
- date: "2026-02-18"
224
224
title: "Library-First Implementation"
225
225
artifacts:
@@ -235,11 +235,12 @@ chapters:
235
235
url: "docs/notes/opentelemetry.md"
236
236
description: |
237
237
- Transitioned the Go fleet to a standardized 'Pure Wrapper' architecture, enabling native OpenTelemetry support and advanced operational visibility across all services.
- Proposed consolidating host-level telemetry (CPU, RAM, Disk, Network, Temperature, and Tailscale) into a single, efficient Go service.
245
-
- Aimed at significant memory reduction by retiring standalone agents and optimizing data collection for long-term FinOps analysis.
244
+
- Deployed a resource-efficient Collectors service, centralizing host-level data collection and optimizing processing with batch processing.
245
+
- Achieved significant cost savings and improved system performance by retiring Alloy, resulting in an 80% reduction in idle resource consumption and freeing up ~50% of reserved CPU and ~65% of reserved memory.
246
+
- Established a modern, OpenTelemetry-native platform, delivering a robust and standardized solution for comprehensive system observability (logs, metrics, traces).
0 commit comments