End-to-end observability platform implementing the three pillars (metrics, logs, traces) with SLI/SLO dashboards, automated alerting, structured runbooks, and chaos engineering validation.
| Component | Tool | Purpose |
|---|---|---|
| Metrics | Prometheus + Grafana | Collection, SLI/SLO dashboards, alerting |
| Tracing | OpenTelemetry + Jaeger | Distributed trace collection and visualization |
| Logging | ELK Stack (Logstash + Kibana) | Centralized log aggregation and search |
| Alerting | Alertmanager + PagerDuty | Automated incident escalation |
| Chaos | Custom scripts | Fault injection (latency, errors) |
| Runbooks | Markdown | Structured incident response procedures |
# Deploy the full stack
kubectl apply -f k8s/app/ # Instrumented application
kubectl apply -f k8s/prometheus/ # Prometheus + Alertmanager
kubectl apply -f k8s/grafana/ # SLI/SLO dashboards
kubectl apply -f k8s/elk/ # ELK logging stack
kubectl apply -f k8s/otel-collector/ # Trace collection
# Inject chaos and observe
./chaos-scripts/chaos-runner.sh latency # Simulate high latency
./chaos-scripts/chaos-runner.sh errors # Simulate error spike
./chaos-scripts/chaos-runner.sh reset # Restore normal state| Metric | Result |
|---|---|
| Mean Time to Detect | < 3 minutes from fault to alert |
| Mean Time to Resolve | < 30 minutes with structured runbooks |
| Three-pillar coverage | Metrics + traces + logs fully correlated |
| SLO compliance | Tracked via Grafana SLI/SLO dashboards |
├── app/src/ # Instrumented Python application
├── chaos-scripts/ # Chaos engineering fault injectors
├── dashboards/grafana/ # SLI/SLO dashboard JSON
├── docker/ # Dockerfiles
├── k8s/
│ ├── alertmanager/ # Alert routing config
│ ├── app/ # App deployment manifests
│ ├── elk/ # Logstash pipeline config
│ ├── otel-collector/ # OTel collector config
│ └── prometheus/ # Prometheus + alert rules
├── postmortem-templates/ # Incident postmortem template
└── runbooks/ # Structured runbooks
This project is for portfolio/demonstration purposes.
