docs: Enhance ADR 015 and update evolution timeline for resource optimization (#229)

victoriacheng15 · web-flow · commit 81a8e2ee6dc6 · 2026-02-22T12:07:50.000-07:00
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -9,60 +9,60 @@ updates:
         patterns:
           - "*"
 
-  # --- Services ---
+  # --- Portal ---
   - package-ecosystem: "gomod"
-    directory: "/services/proxy"
+    directory: "/page"
     schedule:
       interval: "weekly"
 
+  # --- Libraries ---
   - package-ecosystem: "gomod"
-    directory: "/services/reading-sync"
+    directory: "/pkg/brain"
     schedule:
       interval: "weekly"
 
   - package-ecosystem: "gomod"
-    directory: "/services/second-brain"
+    directory: "/pkg/collectors"
     schedule:
       interval: "weekly"
 
   - package-ecosystem: "gomod"
-    directory: "/services/system-metrics"
+    directory: "/pkg/db"
     schedule:
       interval: "weekly"
 
-  # --- Libraries ---
   - package-ecosystem: "gomod"
-    directory: "/pkg/brain"
+    directory: "/pkg/env"
     schedule:
       interval: "weekly"
 
   - package-ecosystem: "gomod"
-    directory: "/pkg/db"
+    directory: "/pkg/secrets"
     schedule:
       interval: "weekly"
 
   - package-ecosystem: "gomod"
-    directory: "/pkg/env"
+    directory: "/pkg/telemetry"
     schedule:
       interval: "weekly"
 
+  # --- Services ---
   - package-ecosystem: "gomod"
-    directory: "/pkg/metrics"
+    directory: "/services/collectors"
     schedule:
       interval: "weekly"
 
   - package-ecosystem: "gomod"
-    directory: "/pkg/secrets"
+    directory: "/services/proxy"
     schedule:
       interval: "weekly"
 
   - package-ecosystem: "gomod"
-    directory: "/pkg/telemetry"
+    directory: "/services/reading-sync"
     schedule:
       interval: "weekly"
 
-  # --- Portal ---
   - package-ecosystem: "gomod"
-    directory: "/page"
+    directory: "/services/second-brain"
     schedule:
       interval: "weekly"
diff --git a/docs/decisions/015-unified-host-telemetry-collectors.md b/docs/decisions/015-unified-host-telemetry-collectors.md
@@ -1,14 +1,14 @@
 # ADR 015: Unified Host Telemetry Collectors
 
-- **Status:** Proposed
+- **Status:** Accepted
 - **Date:** 2026-02-21
 - **Author:** Victoria Cheng
 
 ## Context and Problem Statement
 
 The Observability Hub currently utilizes a fragmented approach for host-level telemetry:
 
-- **Grafana Alloy:** A standalone agent scraping Tailscale logs via `journalctl`. Despite its minimal workload, it requires significant reserved resources (Requests: 20m CPU / 114Mi RAM).
+- **Grafana Alloy:** A standalone agent scraping Tailscale logs via `journalctl`. Despite its minimal workload, its actual idle consumption was approximately 10m CPU and 50Mi RAM, but it required significant reserved resources (Requests: 20m CPU / 114Mi RAM).
 - **Existing `system-metrics`:** A Go service collecting host stats via `gopsutil` every minute, leading to constant database writes and unnecessary resource overhead.
 - **`systemd` units:** Managing legacy collection scripts on the host adds operational complexity.
 
@@ -20,23 +20,31 @@ Consolidate all host-level observability responsibilities into a single, re-arch
 
 ### Key Architectural Shifts
 
-- **Thanos-Centric Metrics:** Shift host metric collection (CPU, RAM, Disk, Network, Temperature) from direct `gopsutil` polling to querying **Thanos Query**. This leverages the unified API for both real-time and long-term storage (MinIO).
+- **Thanos-Centric Metrics:** Host metric collection (CPU, RAM, Disk, Network, Temperature) is now retrieved from **Prometheus** (exposed via Thanos Query). This leverages the unified API for both real-time and long-term storage (MinIO).
 - **Batch Processing Model:** Move from 1-minute continuous polling to a **15-minute batch interval** (as a starting point). The service wakes up every 15 minutes, performs a range query with `step=1m` to maintain granularity, and batch-inserts results into PostgreSQL.
 - **Unified Tailscale Collection:** Incorporate Tailscale status and log collection (via `exec.Command`) directly into the Go service, exposing them via OpenTelemetry and PostgreSQL.
-- **Resource Optimization:** Configure the new service with tight resource requests (e.g., 10m CPU / 40Mi RAM), releasing significant guaranteed memory back to the cluster.
+- **Resource Optimization:** Configure the new service with tight resource requests (10m CPU / 40Mi RAM), releasing significant guaranteed memory back to the cluster.
 
 ### Rationale
 
 - **Efficiency of Batch Processing:** Research confirms that moving from a continuous 1-minute polling cycle to a 15-minute batch interval significantly reduces the average CPU duty cycle. The service transitions from a constant baseline draw to a "wake-perform-sleep" model, making it virtually invisible to the CPU scheduler for 99% of its operational life.
-- **Optimization of Reserved Resources:** Empirical observation via `kubectl top` reveals that while existing agents like Alloy have low *actual* usage when idle (~54MiB), their high *reserved* requests (~114MiB) tie up "dead" RAM that is unavailable to other workloads. A specialized Go service allows for a high-fidelity reservation (40MiB), returning significant guaranteed memory to the cluster nodes.
+- **Optimization of Reserved Resources:** Empirical observation via `kubectl top` reveals a significant reduction in both idle and reserved resource allocation. The specialized Go service for Collectors now consumes ~2m CPU / 10Mi RAM when idle, with requests set to 10m CPU / 40Mi RAM, effectively returning significant guaranteed CPU and memory resources back to the cluster nodes.
+
+  | Component              | Idle CPU / RAM | Reserved CPU / RAM |
+  | :--------------------- | :------------- | :----------------- |
+  | **Legacy Agent (Alloy)** | ~10m / 50Mi    | 20m / 114Mi        |
+  | **Unified Collector**  | ~2m / 10Mi     | 10m / 40Mi         |
+  | **Net Savings**        | **~8m / 40Mi** | **10m / 74Mi**     |
+  | **% Reduction**        | **80% / 80%**  | **50% / 65%**      |
+
 - **Data Parity & Schema Consistency:** By utilizing PromQL `query_range` with a `step=1m`, we maintain the high-resolution data (1-minute granularity) required for accurate FinOps analysis while gaining the operational benefits of batch processing.
 - **Surgical Consolidation:** This approach allows us to integrate specialized collection (Tailscale, hardware temperatures) into a single path, eliminating the need for three separate management domains (Alloy, systemd, and legacy Go services).
 
 ## Consequences
 
 ### Positive
 
-- **Significant Resource Savings**: Frees up ~70-100MiB of reserved RAM per node.
+- **Significant Resource Savings**: Frees up approximately 8m CPU and 74Mi RAM in reserved resources per node, based on the difference between Alloy's prior requests (20m CPU / 114Mi RAM) and Collectors' new requests (10m CPU / 40Mi RAM), with even larger savings in actual idle usage.
 - **Operational Simplicity**: Replaces three legacy components (Alloy, old `system-metrics`, `systemd` units) with one unified Go binary.
 - **FinOps Readiness**: Provides a curated, efficient historical data source in PostgreSQL for electricity cost analysis.
 - **Architectural Alignment**: Standardizes on Go and the "library-first" pattern.
@@ -48,7 +56,7 @@ Consolidate all host-level observability responsibilities into a single, re-arch
 
 ## Verification
 
-- [ ] **Resource Usage:** Monitor `collectors` pod via `kubectl top` and ensure it operates within the new 40Mi/80Mi RAM limits.
-- [ ] **Data Parity:** Confirm `system_metrics` table in PostgreSQL receives 1-minute interval data for all four metric types plus hardware temperature.
-- [ ] **Tailscale Flow:** Verify `tailscale_*` metrics appear in both OTel and PostgreSQL.
-- [ ] **Decommissioning:** Confirm `alloy` and legacy `system-metrics` units are stopped and removed.
+- [x] **Resource Usage:** Monitor `collectors` pod via `kubectl top` and ensure it operates within the new 40Mi/80Mi RAM limits.
+- [x] **Data Parity:** Confirm `system_metrics` table in PostgreSQL receives 1-minute interval data for all four metric types plus hardware temperature.
+- [x] **Tailscale Flow:** Verify `tailscale_*` logs appear in Grafana (via Loki) and `collectors.tailscale.active` metrics appear in Grafana (via Prometheus/OTel).
+- [x] **Decommissioning:** Confirm `alloy` and legacy `system-metrics` units are stopped and removed.
diff --git a/docs/decisions/README.md b/docs/decisions/README.md
@@ -8,7 +8,7 @@ This directory serves as the **Institutional Memory** for the Observability Hub.
 
 | ADR | Title | Status |
 | :--- | :--- | :--- |
-| **015** | [Unified Host Telemetry Collectors](./015-unified-host-telemetry-collectors.md) | 🟢 Proposed |
+| **015** | [Unified Host Telemetry Collectors](./015-unified-host-telemetry-collectors.md) | 🔵 Accepted |
 | **014** | [Library-First Service Architecture](./014-library-first-service-architecture.md) | 🔵 Accepted |
 | **013** | [Standardize on OpenTelemetry](./013-standardize-on-opentelemetry.md) | 🔵 Accepted |
 | **012** | [Migrate Promtail to Alloy](./012-migrate-promtail-to-alloy.md) | 🔵 Accepted |
diff --git a/page/content/evolution.yaml b/page/content/evolution.yaml
@@ -203,23 +203,23 @@ chapters:
         description: |
           - Completed high-fidelity OpenTelemetry instrumentation for the 'proxy' service, transitioning to dynamic span naming and deep-dive diagnostics.
           - Engineered a synthetic validation suite that simulates global traffic (Region, Timezone, Device) to stress-test Grafana Tempo storage and Grafana visualization.
-          - Resolved a critical "silent" failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
+          - Resolved a critical 'silent' failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
       - date: "2026-02-13"
         title: "Storage Scalability & Security Hardening"
         description: |
           - Scaled telemetry persistence by migrating from restricted local disks to professional S3-compatible object storage (MinIO), ensuring long-term data reliability.
-          - Established a "Safe-by-Default" infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
+          - Established a 'Safe-by-Default' infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
 
   - title: "Platform Maturity & Reusability"
-    intro: "Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services."
+    intro: "Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services, culminating in a unified, OpenTelemetry-native architecture."
     timeline:
       - date: "2026-02-16"
         title: "Proposal: Modular Library Architecture"
         artifacts:
           - name: "ADR 014: Library-First Service Architecture"
             url: "docs/decisions/014-library-first-service-architecture.md"
         description: |
-          - A strategic proposal to organize the platform's core features into reusable "building blocks." This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.
+          - A strategic proposal to organize the platform's core features into reusable 'building blocks.' This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.
       - date: "2026-02-18"
         title: "Library-First Implementation"
         artifacts:
@@ -235,11 +235,12 @@ chapters:
             url: "docs/notes/opentelemetry.md"
         description: |
           - Transitioned the Go fleet to a standardized 'Pure Wrapper' architecture, enabling native OpenTelemetry support and advanced operational visibility across all services.
-      - date: "2026-02-21"
-        title: "Unified Host Telemetry Collectors"
+      - date: "2026-02-22"
+        title: "Unified Host Telemetry Collectors & Grafana Alloy Retirement"
         artifacts:
           - name: "ADR 015: Unified Host Telemetry Collectors"
             url: "docs/decisions/015-unified-host-telemetry-collectors.md"
         description: |
-          - Proposed consolidating host-level telemetry (CPU, RAM, Disk, Network, Temperature, and Tailscale) into a single, efficient Go service.
-          - Aimed at significant memory reduction by retiring standalone agents and optimizing data collection for long-term FinOps analysis.
+          - Deployed a resource-efficient Collectors service, centralizing host-level data collection and optimizing processing with batch processing.
+          - Achieved significant cost savings and improved system performance by retiring Alloy, resulting in an 80% reduction in idle resource consumption and freeing up ~50% of reserved CPU and ~65% of reserved memory.
+          - Established a modern, OpenTelemetry-native platform, delivering a robust and standardized solution for comprehensive system observability (logs, metrics, traces).