Skip to content

bukx/observability-platform

Repository files navigation

📊 Full-Stack Observability & Incident Response Platform

Validate

Prometheus Grafana Kubernetes OpenTelemetry Docker Python

End-to-end observability platform implementing the three pillars (metrics, logs, traces) with SLI/SLO dashboards, automated alerting, structured runbooks, and chaos engineering validation.


🏗 Architecture

Architecture Diagram

🔧 Tech Stack

Component Tool Purpose
Metrics Prometheus + Grafana Collection, SLI/SLO dashboards, alerting
Tracing OpenTelemetry + Jaeger Distributed trace collection and visualization
Logging ELK Stack (Logstash + Kibana) Centralized log aggregation and search
Alerting Alertmanager + PagerDuty Automated incident escalation
Chaos Custom scripts Fault injection (latency, errors)
Runbooks Markdown Structured incident response procedures

🚀 Quick Start

# Deploy the full stack
kubectl apply -f k8s/app/            # Instrumented application
kubectl apply -f k8s/prometheus/      # Prometheus + Alertmanager
kubectl apply -f k8s/grafana/         # SLI/SLO dashboards
kubectl apply -f k8s/elk/            # ELK logging stack
kubectl apply -f k8s/otel-collector/ # Trace collection

# Inject chaos and observe
./chaos-scripts/chaos-runner.sh latency   # Simulate high latency
./chaos-scripts/chaos-runner.sh errors    # Simulate error spike
./chaos-scripts/chaos-runner.sh reset     # Restore normal state

📈 Key Outcomes

Metric Result
Mean Time to Detect < 3 minutes from fault to alert
Mean Time to Resolve < 30 minutes with structured runbooks
Three-pillar coverage Metrics + traces + logs fully correlated
SLO compliance Tracked via Grafana SLI/SLO dashboards

📁 Project Structure

├── app/src/                    # Instrumented Python application
├── chaos-scripts/              # Chaos engineering fault injectors
├── dashboards/grafana/         # SLI/SLO dashboard JSON
├── docker/                     # Dockerfiles
├── k8s/
│   ├── alertmanager/           # Alert routing config
│   ├── app/                    # App deployment manifests
│   ├── elk/                    # Logstash pipeline config
│   ├── otel-collector/         # OTel collector config
│   └── prometheus/             # Prometheus + alert rules
├── postmortem-templates/       # Incident postmortem template
└── runbooks/                   # Structured runbooks

📜 License

This project is for portfolio/demonstration purposes.

About

Full-stack observability platform with Prometheus, Grafana, ELK, OpenTelemetry, SLI/SLO dashboards, and chaos engineering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors