Skip to content

victoriacheng15/observability-hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

276 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Observability Hub

A resilient, self-hosted platform meticulously engineered to showcase advanced Site Reliability Engineering (SRE) and Platform Engineering principles. It delivers full-stack observability (Logs, Metrics, Traces), GitOps-driven infrastructure management, and standardized telemetry ingestion for complex cloud-native environments.

Built using Go and orchestrated on Kubernetes (K3s), the platform unifies system metrics, application events, and logs into a single queryable layer leveraging OpenTelemetry, PostgreSQL (TimescaleDB), Grafana Loki, Prometheus, and Grafana. It's designed for operational excellence, demonstrating how to build a robust, observable, and maintainable system from the ground up.

Explore Live Telemetry & System Evolution


πŸš€ Key Achievements & Capabilities

This project highlights significant accomplishments in building a modern observability and platform engineering solution:

  • Unified Go Monorepo: Consolidated fragmented modules into a single root module, eliminating 17 replace directives and standardizing dependency management across all services.
  • Encapsulated Architecture: Transitioned to an internal/ and cmd/ layout, enforcing Go's package visibility rules and adopting the "Thin Main" pattern for better testability and system integrity.
  • Full OpenTelemetry (LMT) Implementation: Achieved end-to-end observability with a unified OTel Collector, Tempo (Traces), Prometheus (Metrics), Loki (Logs), and Go SDK for instrumentation.
  • GitOps Reconciliation Engine: Implemented a secure, templated GitOps reconciliation engine for automated state enforcement via webhooks, scaled to support multi-tenant synchronization.
  • Kubernetes Migration & Cloud-Native Operations: All core observability stack components (Loki, Grafana, Tempo, Prometheus, Postgres) are running natively in Kubernetes with persistent storage.
  • Centralized Secrets Management: Integrated OpenBao for secure, dynamic credential retrieval across all services, replacing insecure static configurations.
  • Hybrid Cloud Architecture (Store-and-Forward Bridge): Designed and implemented a secure bridge for ingesting external telemetry without exposing local ports, ensuring reliable data flow from diverse sources.
  • Reproducible Local Development: Ensures consistent and reproducible developer environments via shell.nix and docker-compose.
  • Formalized Decision-Making & Incident Response: Established an Architectural Decision Record (ADR) process and an Incident Response/RCA framework for structured decision-making and operational excellence.
  • Unified Host Telemetry Collectors: Deployed a resource-efficient collectors service, centralizing host-level data collection and optimizing processing.

πŸ“š Further Documentation

For deeper insights into the project's structure and operational guides:

  • Documentation Hub: Central entry point for Architecture, Decisions (ADRs), and Operational Notes.

πŸ› οΈ Tech Stack & Architecture

The platform leverages a robust set of modern technologies for its core functions:

Go

OpenTofu Kubernetes (K3s) Helm Docker OpenBao Tailscale

OpenTelemetry Grafana Loki Grafana Grafana Tempo Prometheus

PostgreSQL MinIO (S3) MongoDB

System Architecture Overview

The diagram below illustrates the high-level flow of telemetry data from collection to visualization, highlighting the hybrid orchestration model between host services and the Kubernetes data platform.

flowchart TB
    subgraph ObservabilityHub ["Observability Hub"]
        direction TB
        subgraph Logic ["Data Ingestion"]
            subgraph External ["External Sources"]
                GH(GitHub Webhooks/Journals)
                Mongo(MongoDB Atlas)
            end

            subgraph Security [Security]
                Bao[OpenBao]
                Tailscale[Tailscale]
            end

            GoApps["Go Services (Proxy, Ingestion)"]
            MCP["MCP Telemetry"]
            Collectors["Collectors (Host Metrics & Tailscale)"]
        end

        OTEL[OpenTelemetry Collector]

        Observability["Loki, Tempo, and Prometheus (Thanos)"]
        subgraph Storage ["Data Engines"]
            PG[(PostgreSQL)]
            S3[(MinIO - S3)]
        end
        

        subgraph Visualization ["Visualization"]
            Grafana[Grafana Dashboards]
        end
    end

    %% Data Pipeline Connections
    GH --> GoApps
    Mongo --> GoApps
    Observability -- "Host Metrics" --> Collectors
    Observability -- "Query Data" --> MCP
    Tailscale -- "Status" --> Collectors
    Collectors -- "Host Metrics Data" --> PG
    GoApps -- Data --> PG

    %% Telemetry Pipeline (OTLP)
    GoApps & MCP & Collectors -- "Logs, Metrics, Traces" --> OTEL
    OTEL --> Observability
    Observability -- "Offload" --> S3

    %% Visualization Connections
    Observability & PG --> Grafana
Loading

πŸ—οΈ Engineering Principles

Foundational principles guide every aspect of the platform's development and operation:

  • Signals over Noise: Standardizing telemetry signals to provide immediate clarity on service behavior across the entire stack.
  • Logic over Plumbing: Decoupling infrastructure boilerplate from service logic using shared Go wrappers to focus on domain value.
  • Config as the Truth: Using GitOps to ensure version control remains the ultimate source of truth, with automated state reconciliation.
  • Pragmatic Orchestration: Leveraging Kubernetes for persistence and native Systemd for host automation to maximize reliability with minimal overhead.

πŸš€ Getting Started (Local Development)

This guide will help you set up and run the observability-hub locally using Kubernetes (K3s).

Prerequisites

Ensure you have the following installed on your system:

  • Go
  • K3s (Lightweight Kubernetes)
  • Helm
  • make (GNU Make)
  • Nix (for reproducible toolchains)

1. Configuration

The project uses a .env file to manage environment variables, especially for database connections and API keys.

# Start by copying the example file
cp .env.example .env

You will need to edit the newly created .env file to configure connections for MongoDB Atlas, PostgreSQL (K3s NodePort), and other services.

2. Build and Run the Stack

The platform utilizes a hybrid orchestration model. You must deploy both the Kubernetes data tier and the native host services.

A. Data Infrastructure (K3s)

Deploy the observability backend using OpenTofu (IaC):

cd tofu
tofu init
tofu apply

This will provision PostgreSQL, MinIO, Loki, Tempo, Prometheus, Thanos, Grafana, and the OpenTelemetry Collector in the observability namespace.

For the Collectors service (which uses a custom local image), use the Makefile target:

make k3s-collectors-up

B. Native Host Services

Build and initialize the automation and telemetry collectors on the host:

# Build Go binaries
make proxy-build
make ingestion-build

# Install and start Systemd services (requires sudo)
make install-services

3. Verification

Once the stack is running, you can verify the end-to-end telemetry flow:

  • Cluster Health: Access Grafana at http://localhost:30000 (NodePort).
  • Service Logs: Check logs for host components via Grafana Loki.

4. Managing the Cluster

To stop or remove resources, use the standard kubectl delete commands targeting the observability namespace.

About

Kubernetes/systemd bridge (Go/OTel) via MCP for AI agents. Features LGTM stack, HMAC GitOps webhooks, and PII masking for secure, agentic incident triage and observability insights.

Topics

Resources

License

Stars

Watchers

Forks

Contributors