Skip to content

Commit 81a8e2e

Browse files
docs: Enhance ADR 015 and update evolution timeline for resource optimization (#229)
1 parent 6c56621 commit 81a8e2e

File tree

4 files changed

+42
-33
lines changed

4 files changed

+42
-33
lines changed

.github/dependabot.yml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,60 +9,60 @@ updates:
99
patterns:
1010
- "*"
1111

12-
# --- Services ---
12+
# --- Portal ---
1313
- package-ecosystem: "gomod"
14-
directory: "/services/proxy"
14+
directory: "/page"
1515
schedule:
1616
interval: "weekly"
1717

18+
# --- Libraries ---
1819
- package-ecosystem: "gomod"
19-
directory: "/services/reading-sync"
20+
directory: "/pkg/brain"
2021
schedule:
2122
interval: "weekly"
2223

2324
- package-ecosystem: "gomod"
24-
directory: "/services/second-brain"
25+
directory: "/pkg/collectors"
2526
schedule:
2627
interval: "weekly"
2728

2829
- package-ecosystem: "gomod"
29-
directory: "/services/system-metrics"
30+
directory: "/pkg/db"
3031
schedule:
3132
interval: "weekly"
3233

33-
# --- Libraries ---
3434
- package-ecosystem: "gomod"
35-
directory: "/pkg/brain"
35+
directory: "/pkg/env"
3636
schedule:
3737
interval: "weekly"
3838

3939
- package-ecosystem: "gomod"
40-
directory: "/pkg/db"
40+
directory: "/pkg/secrets"
4141
schedule:
4242
interval: "weekly"
4343

4444
- package-ecosystem: "gomod"
45-
directory: "/pkg/env"
45+
directory: "/pkg/telemetry"
4646
schedule:
4747
interval: "weekly"
4848

49+
# --- Services ---
4950
- package-ecosystem: "gomod"
50-
directory: "/pkg/metrics"
51+
directory: "/services/collectors"
5152
schedule:
5253
interval: "weekly"
5354

5455
- package-ecosystem: "gomod"
55-
directory: "/pkg/secrets"
56+
directory: "/services/proxy"
5657
schedule:
5758
interval: "weekly"
5859

5960
- package-ecosystem: "gomod"
60-
directory: "/pkg/telemetry"
61+
directory: "/services/reading-sync"
6162
schedule:
6263
interval: "weekly"
6364

64-
# --- Portal ---
6565
- package-ecosystem: "gomod"
66-
directory: "/page"
66+
directory: "/services/second-brain"
6767
schedule:
6868
interval: "weekly"
Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# ADR 015: Unified Host Telemetry Collectors
22

3-
- **Status:** Proposed
3+
- **Status:** Accepted
44
- **Date:** 2026-02-21
55
- **Author:** Victoria Cheng
66

77
## Context and Problem Statement
88

99
The Observability Hub currently utilizes a fragmented approach for host-level telemetry:
1010

11-
- **Grafana Alloy:** A standalone agent scraping Tailscale logs via `journalctl`. Despite its minimal workload, it requires significant reserved resources (Requests: 20m CPU / 114Mi RAM).
11+
- **Grafana Alloy:** A standalone agent scraping Tailscale logs via `journalctl`. Despite its minimal workload, its actual idle consumption was approximately 10m CPU and 50Mi RAM, but it required significant reserved resources (Requests: 20m CPU / 114Mi RAM).
1212
- **Existing `system-metrics`:** A Go service collecting host stats via `gopsutil` every minute, leading to constant database writes and unnecessary resource overhead.
1313
- **`systemd` units:** Managing legacy collection scripts on the host adds operational complexity.
1414

@@ -20,23 +20,31 @@ Consolidate all host-level observability responsibilities into a single, re-arch
2020

2121
### Key Architectural Shifts
2222

23-
- **Thanos-Centric Metrics:** Shift host metric collection (CPU, RAM, Disk, Network, Temperature) from direct `gopsutil` polling to querying **Thanos Query**. This leverages the unified API for both real-time and long-term storage (MinIO).
23+
- **Thanos-Centric Metrics:** Host metric collection (CPU, RAM, Disk, Network, Temperature) is now retrieved from **Prometheus** (exposed via Thanos Query). This leverages the unified API for both real-time and long-term storage (MinIO).
2424
- **Batch Processing Model:** Move from 1-minute continuous polling to a **15-minute batch interval** (as a starting point). The service wakes up every 15 minutes, performs a range query with `step=1m` to maintain granularity, and batch-inserts results into PostgreSQL.
2525
- **Unified Tailscale Collection:** Incorporate Tailscale status and log collection (via `exec.Command`) directly into the Go service, exposing them via OpenTelemetry and PostgreSQL.
26-
- **Resource Optimization:** Configure the new service with tight resource requests (e.g., 10m CPU / 40Mi RAM), releasing significant guaranteed memory back to the cluster.
26+
- **Resource Optimization:** Configure the new service with tight resource requests (10m CPU / 40Mi RAM), releasing significant guaranteed memory back to the cluster.
2727

2828
### Rationale
2929

3030
- **Efficiency of Batch Processing:** Research confirms that moving from a continuous 1-minute polling cycle to a 15-minute batch interval significantly reduces the average CPU duty cycle. The service transitions from a constant baseline draw to a "wake-perform-sleep" model, making it virtually invisible to the CPU scheduler for 99% of its operational life.
31-
- **Optimization of Reserved Resources:** Empirical observation via `kubectl top` reveals that while existing agents like Alloy have low *actual* usage when idle (~54MiB), their high *reserved* requests (~114MiB) tie up "dead" RAM that is unavailable to other workloads. A specialized Go service allows for a high-fidelity reservation (40MiB), returning significant guaranteed memory to the cluster nodes.
31+
- **Optimization of Reserved Resources:** Empirical observation via `kubectl top` reveals a significant reduction in both idle and reserved resource allocation. The specialized Go service for Collectors now consumes ~2m CPU / 10Mi RAM when idle, with requests set to 10m CPU / 40Mi RAM, effectively returning significant guaranteed CPU and memory resources back to the cluster nodes.
32+
33+
| Component | Idle CPU / RAM | Reserved CPU / RAM |
34+
| :--------------------- | :------------- | :----------------- |
35+
| **Legacy Agent (Alloy)** | ~10m / 50Mi | 20m / 114Mi |
36+
| **Unified Collector** | ~2m / 10Mi | 10m / 40Mi |
37+
| **Net Savings** | **~8m / 40Mi** | **10m / 74Mi** |
38+
| **% Reduction** | **80% / 80%** | **50% / 65%** |
39+
3240
- **Data Parity & Schema Consistency:** By utilizing PromQL `query_range` with a `step=1m`, we maintain the high-resolution data (1-minute granularity) required for accurate FinOps analysis while gaining the operational benefits of batch processing.
3341
- **Surgical Consolidation:** This approach allows us to integrate specialized collection (Tailscale, hardware temperatures) into a single path, eliminating the need for three separate management domains (Alloy, systemd, and legacy Go services).
3442

3543
## Consequences
3644

3745
### Positive
3846

39-
- **Significant Resource Savings**: Frees up ~70-100MiB of reserved RAM per node.
47+
- **Significant Resource Savings**: Frees up approximately 8m CPU and 74Mi RAM in reserved resources per node, based on the difference between Alloy's prior requests (20m CPU / 114Mi RAM) and Collectors' new requests (10m CPU / 40Mi RAM), with even larger savings in actual idle usage.
4048
- **Operational Simplicity**: Replaces three legacy components (Alloy, old `system-metrics`, `systemd` units) with one unified Go binary.
4149
- **FinOps Readiness**: Provides a curated, efficient historical data source in PostgreSQL for electricity cost analysis.
4250
- **Architectural Alignment**: Standardizes on Go and the "library-first" pattern.
@@ -48,7 +56,7 @@ Consolidate all host-level observability responsibilities into a single, re-arch
4856

4957
## Verification
5058

51-
- [ ] **Resource Usage:** Monitor `collectors` pod via `kubectl top` and ensure it operates within the new 40Mi/80Mi RAM limits.
52-
- [ ] **Data Parity:** Confirm `system_metrics` table in PostgreSQL receives 1-minute interval data for all four metric types plus hardware temperature.
53-
- [ ] **Tailscale Flow:** Verify `tailscale_*` metrics appear in both OTel and PostgreSQL.
54-
- [ ] **Decommissioning:** Confirm `alloy` and legacy `system-metrics` units are stopped and removed.
59+
- [x] **Resource Usage:** Monitor `collectors` pod via `kubectl top` and ensure it operates within the new 40Mi/80Mi RAM limits.
60+
- [x] **Data Parity:** Confirm `system_metrics` table in PostgreSQL receives 1-minute interval data for all four metric types plus hardware temperature.
61+
- [x] **Tailscale Flow:** Verify `tailscale_*` logs appear in Grafana (via Loki) and `collectors.tailscale.active` metrics appear in Grafana (via Prometheus/OTel).
62+
- [x] **Decommissioning:** Confirm `alloy` and legacy `system-metrics` units are stopped and removed.

docs/decisions/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This directory serves as the **Institutional Memory** for the Observability Hub.
88

99
| ADR | Title | Status |
1010
| :--- | :--- | :--- |
11-
| **015** | [Unified Host Telemetry Collectors](./015-unified-host-telemetry-collectors.md) | 🟢 Proposed |
11+
| **015** | [Unified Host Telemetry Collectors](./015-unified-host-telemetry-collectors.md) | 🔵 Accepted |
1212
| **014** | [Library-First Service Architecture](./014-library-first-service-architecture.md) | 🔵 Accepted |
1313
| **013** | [Standardize on OpenTelemetry](./013-standardize-on-opentelemetry.md) | 🔵 Accepted |
1414
| **012** | [Migrate Promtail to Alloy](./012-migrate-promtail-to-alloy.md) | 🔵 Accepted |

page/content/evolution.yaml

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -203,23 +203,23 @@ chapters:
203203
description: |
204204
- Completed high-fidelity OpenTelemetry instrumentation for the 'proxy' service, transitioning to dynamic span naming and deep-dive diagnostics.
205205
- Engineered a synthetic validation suite that simulates global traffic (Region, Timezone, Device) to stress-test Grafana Tempo storage and Grafana visualization.
206-
- Resolved a critical "silent" failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
206+
- Resolved a critical 'silent' failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
207207
- date: "2026-02-13"
208208
title: "Storage Scalability & Security Hardening"
209209
description: |
210210
- Scaled telemetry persistence by migrating from restricted local disks to professional S3-compatible object storage (MinIO), ensuring long-term data reliability.
211-
- Established a "Safe-by-Default" infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
211+
- Established a 'Safe-by-Default' infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
212212
213213
- title: "Platform Maturity & Reusability"
214-
intro: "Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services."
214+
intro: "Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services, culminating in a unified, OpenTelemetry-native architecture."
215215
timeline:
216216
- date: "2026-02-16"
217217
title: "Proposal: Modular Library Architecture"
218218
artifacts:
219219
- name: "ADR 014: Library-First Service Architecture"
220220
url: "docs/decisions/014-library-first-service-architecture.md"
221221
description: |
222-
- A strategic proposal to organize the platform's core features into reusable "building blocks." This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.
222+
- A strategic proposal to organize the platform's core features into reusable 'building blocks.' This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.
223223
- date: "2026-02-18"
224224
title: "Library-First Implementation"
225225
artifacts:
@@ -235,11 +235,12 @@ chapters:
235235
url: "docs/notes/opentelemetry.md"
236236
description: |
237237
- Transitioned the Go fleet to a standardized 'Pure Wrapper' architecture, enabling native OpenTelemetry support and advanced operational visibility across all services.
238-
- date: "2026-02-21"
239-
title: "Unified Host Telemetry Collectors"
238+
- date: "2026-02-22"
239+
title: "Unified Host Telemetry Collectors & Grafana Alloy Retirement"
240240
artifacts:
241241
- name: "ADR 015: Unified Host Telemetry Collectors"
242242
url: "docs/decisions/015-unified-host-telemetry-collectors.md"
243243
description: |
244-
- Proposed consolidating host-level telemetry (CPU, RAM, Disk, Network, Temperature, and Tailscale) into a single, efficient Go service.
245-
- Aimed at significant memory reduction by retiring standalone agents and optimizing data collection for long-term FinOps analysis.
244+
- Deployed a resource-efficient Collectors service, centralizing host-level data collection and optimizing processing with batch processing.
245+
- Achieved significant cost savings and improved system performance by retiring Alloy, resulting in an 80% reduction in idle resource consumption and freeing up ~50% of reserved CPU and ~65% of reserved memory.
246+
- Established a modern, OpenTelemetry-native platform, delivering a robust and standardized solution for comprehensive system observability (logs, metrics, traces).

0 commit comments

Comments
 (0)