Observability Hub

A resilient, self-hosted platform meticulously engineered to showcase advanced Site Reliability Engineering (SRE) and Platform Engineering principles. It delivers full-stack observability (Logs, Metrics, Traces), GitOps-driven infrastructure management, and standardized telemetry ingestion for complex cloud-native environments.

Built using Go and orchestrated on Kubernetes (K3s), the platform unifies system metrics, application events, and logs into a single queryable layer leveraging OpenTelemetry, PostgreSQL (TimescaleDB), Grafana Loki, Prometheus, and Grafana. It's designed for operational excellence, demonstrating how to build a robust, observable, and maintainable system from the ground up.

Explore Live Telemetry & System Evolution

🚀 Key Achievements & Capabilities

This project highlights significant accomplishments in building a modern observability and platform engineering solution:

Unified Go Monorepo: Consolidated fragmented modules into a single root module, eliminating 17 replace directives and standardizing dependency management across all services.
Encapsulated Architecture: Transitioned to an internal/ and cmd/ layout, enforcing Go's package visibility rules and adopting the "Thin Main" pattern for better testability and system integrity.
Full OpenTelemetry (LMT) Implementation: Achieved end-to-end observability with a unified OTel Collector, Tempo (Traces), Prometheus (Metrics), Loki (Logs), and Go SDK for instrumentation.
GitOps Reconciliation Engine: Implemented a secure, templated GitOps reconciliation engine for automated state enforcement via webhooks, scaled to support multi-tenant synchronization.
Kubernetes Migration & Cloud-Native Operations: All core observability stack components (Loki, Grafana, Tempo, Prometheus, Postgres) are running natively in Kubernetes with persistent storage.
Centralized Secrets Management: Integrated OpenBao for secure, dynamic credential retrieval across all services, replacing insecure static configurations.
Hybrid Cloud Architecture (Store-and-Forward Bridge): Designed and implemented a secure bridge for ingesting external telemetry without exposing local ports, ensuring reliable data flow from diverse sources.
Reproducible Local Development: Ensures consistent and reproducible developer environments via shell.nix and docker-compose.
Formalized Decision-Making & Incident Response: Established an Architectural Decision Record (ADR) process and an Incident Response/RCA framework for structured decision-making and operational excellence.
Domain-Isolated Agentic Interface (MCP): Hardened the platform's security posture by adopting a domain-isolated architecture for AI agents, strictly decoupling infrastructure investigations (mcp-pods) from the telemetry pipeline (mcp-telemetry) to enforce the Principle of Least Privilege.
Unified Host Telemetry Analytics: Deployed a resource-efficient analytics service, centralizing host-level data collection and optimizing processing.

📚 Further Documentation

For deeper insights into the project's structure and operational guides:

Documentation Hub: Central entry point for Architecture, Decisions (ADRs), and Operational Notes.

🛠️ Tech Stack & Architecture

The platform leverages a robust set of modern technologies for its core functions:

System Architecture Overview

The diagram below illustrates the high-level flow of telemetry data from collection to visualization, highlighting the hybrid orchestration model between host services and the Kubernetes data platform.

flowchart TB
    subgraph ObservabilityHub ["Observability Hub"]
        direction TB
        subgraph Logic ["Data Ingestion & Agentic Interface"]
            subgraph External ["External Sources"]
                GH(GitHub Webhooks/Journals)
                Mongo(MongoDB Atlas)
            end

            subgraph Security [Security]
                Bao[OpenBao]
                Tailscale[Tailscale]
            end

            GoApps["Go Services (Proxy, Ingestion)"]
            MCP_Tele["MCP Telemetry (Health Brain)"]
            MCP_Pods["MCP Pods (Infra Brain)"]
            Analytics["Analytics (Host Metrics & Tailscale)"]
        end

        K8S["Kubernetes API (Cluster State)"]
        OTEL[OpenTelemetry Collector]

        Observability["Loki, Tempo, and Prometheus (Thanos)"]
        subgraph Storage ["Data Engines"]
            PG[(PostgreSQL)]
            S3[(MinIO - S3)]
        end
        

        subgraph Visualization ["Visualization"]
            Grafana[Grafana Dashboards]
        end
    end

    %% Data Pipeline Connections
    GH --> GoApps
    Mongo --> GoApps
    
    %% Domain-Isolated MCP Paths
    Observability -- "Query Data" --> MCP_Tele
    K8S -- "Cluster State" --> MCP_Pods

    %% Telemetry & Storage Connections
    Observability -- "Host Metrics" --> Analytics
    Tailscale -- "Status" --> Analytics
    Analytics -- "Host Metrics Data" --> PG
    GoApps -- Data --> PG

    %% Telemetry Pipeline (OTLP)
    GoApps & MCP_Tele & MCP_Pods & Analytics -- "Logs, Metrics, Traces" --> OTEL
    OTEL --> Observability
    Observability -- "Offload" --> S3

    %% Visualization Connections
    Observability & PG --> Grafana

🏗️ Engineering Principles

Foundational principles guide every aspect of the platform's development and operation:

Signals over Noise: Standardizing telemetry signals to provide immediate clarity on service behavior across the entire stack.
Logic over Plumbing: Decoupling infrastructure boilerplate from service logic using shared Go wrappers to focus on domain value.
Config as the Truth: Using GitOps to ensure version control remains the ultimate source of truth, with automated state reconciliation.
Pragmatic Orchestration: Leveraging Kubernetes for persistence and native Systemd for host automation to maximize reliability with minimal overhead.

🚀 Getting Started (Local Development)

This guide will help you set up and run the observability-hub locally using Kubernetes (K3s).

Prerequisites

Ensure you have the following installed on your system:

Go
K3s (Lightweight Kubernetes)
Helm
make (GNU Make)
Nix (for reproducible toolchains)

1. Configuration

The project uses a .env file to manage environment variables, especially for database connections and API keys.

# Start by copying the example file
cp .env.example .env

You will need to edit the newly created .env file to configure connections for MongoDB Atlas, PostgreSQL (K3s NodePort), and other services.

2. Build and Run the Stack

The platform utilizes a hybrid orchestration model. You must deploy both the Kubernetes data tier and the native host services.

A. Data Infrastructure (K3s)

Deploy the observability backend using OpenTofu (IaC):

cd tofu
tofu init
tofu apply

This will provision PostgreSQL, MinIO, Loki, Tempo, Prometheus, Thanos, Grafana, and the OpenTelemetry Collector in the observability namespace.

For the Analytics service (which uses a custom local image), use the Makefile target:

make k3s-analytics-up

B. Native Host Services

Build and initialize the automation and telemetry analytics on the host:

# Build Go binaries
make proxy-build
make ingestion-build

# Install and start Systemd services (requires sudo)
make install-services

3. Verification

Once the stack is running, you can verify the end-to-end telemetry flow:

Cluster Health: Access Grafana at http://localhost:30000 (NodePort).
Service Logs: Check logs for host components via Grafana Loki.

4. Managing the Cluster

To stop or remove resources, use the standard kubectl delete commands targeting the observability namespace.

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
.github		.github
cmd		cmd
config		config
docker		docker
docs		docs
internal		internal
k3s		k3s
makefiles		makefiles
policies		policies
scripts		scripts
systemd		systemd
tofu		tofu
.env.example		.env.example
.gitignore		.gitignore
.kube-linter.yaml		.kube-linter.yaml
.markdownlint.json		.markdownlint.json
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Observability Hub

🚀 Key Achievements & Capabilities

📚 Further Documentation

🛠️ Tech Stack & Architecture

System Architecture Overview

🏗️ Engineering Principles

🚀 Getting Started (Local Development)

Prerequisites

1. Configuration

2. Build and Run the Stack

A. Data Infrastructure (K3s)

B. Native Host Services

3. Verification

4. Managing the Cluster

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Observability Hub

🚀 Key Achievements & Capabilities

📚 Further Documentation

🛠️ Tech Stack & Architecture

System Architecture Overview

🏗️ Engineering Principles

🚀 Getting Started (Local Development)

Prerequisites

1. Configuration

2. Build and Run the Stack

A. Data Infrastructure (K3s)

B. Native Host Services

3. Verification

4. Managing the Cluster

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages