A resilient, self-hosted platform meticulously engineered to showcase advanced Site Reliability Engineering (SRE) and Platform Engineering principles. It delivers full-stack observability (Logs, Metrics, Traces), GitOps-driven infrastructure management, and standardized telemetry ingestion for complex cloud-native environments.
Built using Go and orchestrated on Kubernetes (K3s), the platform unifies system metrics, application events, and logs into a single queryable layer leveraging OpenTelemetry, PostgreSQL (TimescaleDB), Grafana Loki, Prometheus, and Grafana. It's designed for operational excellence, demonstrating how to build a robust, observable, and maintainable system from the ground up.
Explore Live Telemetry & System Evolution
This project highlights significant accomplishments in building a modern observability and platform engineering solution:
- Unified Go Monorepo: Consolidated fragmented modules into a single root module, eliminating 17
replacedirectives and standardizing dependency management across all services. - Encapsulated Architecture: Transitioned to an
internal/andcmd/layout, enforcing Go's package visibility rules and adopting the "Thin Main" pattern for better testability and system integrity. - Full OpenTelemetry (LMT) Implementation: Achieved end-to-end observability with a unified OTel Collector, Tempo (Traces), Prometheus (Metrics), Loki (Logs), and Go SDK for instrumentation.
- GitOps Reconciliation Engine: Implemented a secure, templated GitOps reconciliation engine for automated state enforcement via webhooks, scaled to support multi-tenant synchronization.
- Kubernetes Migration & Cloud-Native Operations: All core observability stack components (Loki, Grafana, Tempo, Prometheus, Postgres) are running natively in Kubernetes with persistent storage.
- Centralized Secrets Management: Integrated OpenBao for secure, dynamic credential retrieval across all services, replacing insecure static configurations.
- Hybrid Cloud Architecture (Store-and-Forward Bridge): Designed and implemented a secure bridge for ingesting external telemetry without exposing local ports, ensuring reliable data flow from diverse sources.
- Reproducible Local Development: Ensures consistent and reproducible developer environments via
shell.nixanddocker-compose. - Formalized Decision-Making & Incident Response: Established an Architectural Decision Record (ADR) process and an Incident Response/RCA framework for structured decision-making and operational excellence.
- Unified Host Telemetry Collectors: Deployed a resource-efficient
collectorsservice, centralizing host-level data collection and optimizing processing.
For deeper insights into the project's structure and operational guides:
- Documentation Hub: Central entry point for Architecture, Decisions (ADRs), and Operational Notes.
The platform leverages a robust set of modern technologies for its core functions:
The diagram below illustrates the high-level flow of telemetry data from collection to visualization, highlighting the hybrid orchestration model between host services and the Kubernetes data platform.
flowchart TB
subgraph ObservabilityHub ["Observability Hub"]
direction TB
subgraph Logic ["Data Ingestion"]
subgraph External ["External Sources"]
GH(GitHub Webhooks/Journals)
Mongo(MongoDB Atlas)
end
subgraph Security [Security]
Bao[OpenBao]
Tailscale[Tailscale]
end
GoApps["Go Services (Proxy, Ingestion)"]
MCP["MCP Telemetry"]
Collectors["Collectors (Host Metrics & Tailscale)"]
end
OTEL[OpenTelemetry Collector]
Observability["Loki, Tempo, and Prometheus (Thanos)"]
subgraph Storage ["Data Engines"]
PG[(PostgreSQL)]
S3[(MinIO - S3)]
end
subgraph Visualization ["Visualization"]
Grafana[Grafana Dashboards]
end
end
%% Data Pipeline Connections
GH --> GoApps
Mongo --> GoApps
Observability -- "Host Metrics" --> Collectors
Observability -- "Query Data" --> MCP
Tailscale -- "Status" --> Collectors
Collectors -- "Host Metrics Data" --> PG
GoApps -- Data --> PG
%% Telemetry Pipeline (OTLP)
GoApps & MCP & Collectors -- "Logs, Metrics, Traces" --> OTEL
OTEL --> Observability
Observability -- "Offload" --> S3
%% Visualization Connections
Observability & PG --> Grafana
Foundational principles guide every aspect of the platform's development and operation:
- Signals over Noise: Standardizing telemetry signals to provide immediate clarity on service behavior across the entire stack.
- Logic over Plumbing: Decoupling infrastructure boilerplate from service logic using shared Go wrappers to focus on domain value.
- Config as the Truth: Using GitOps to ensure version control remains the ultimate source of truth, with automated state reconciliation.
- Pragmatic Orchestration: Leveraging Kubernetes for persistence and native Systemd for host automation to maximize reliability with minimal overhead.
This guide will help you set up and run the observability-hub locally using Kubernetes (K3s).
Ensure you have the following installed on your system:
The project uses a .env file to manage environment variables, especially for database connections and API keys.
# Start by copying the example file
cp .env.example .envYou will need to edit the newly created .env file to configure connections for MongoDB Atlas, PostgreSQL (K3s NodePort), and other services.
The platform utilizes a hybrid orchestration model. You must deploy both the Kubernetes data tier and the native host services.
Deploy the observability backend using OpenTofu (IaC):
cd tofu
tofu init
tofu applyThis will provision PostgreSQL, MinIO, Loki, Tempo, Prometheus, Thanos, Grafana, and the OpenTelemetry Collector in the observability namespace.
For the Collectors service (which uses a custom local image), use the Makefile target:
make k3s-collectors-upBuild and initialize the automation and telemetry collectors on the host:
# Build Go binaries
make proxy-build
make ingestion-build
# Install and start Systemd services (requires sudo)
make install-servicesOnce the stack is running, you can verify the end-to-end telemetry flow:
- Cluster Health: Access Grafana at
http://localhost:30000(NodePort). - Service Logs: Check logs for host components via Grafana Loki.
To stop or remove resources, use the standard kubectl delete commands targeting the observability namespace.