This repository was archived by the owner on Dec 7, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Add dash service - Incident Response Dashboard for apps/ directory #5
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Dash - Incident Response Dashboard Service
Add dash service to the apps/ directory - a real-time incident response dashboard for infrastructure and code observability.
Project Structure Integration
Based on the existing project structure, dash will be added as:
apps/
├── chat/
├── gateway/
├── memory/
├── seo/
├── user/
└── dash/ # ← New incident response dashboard
├── app/
│ ├── main.py
│ ├── aggregator.py
│ ├── correlator.py
│ └── websockets.py
├── frontend/
│ ├── src/
│ └── package.json
├── tests/
├── .env
├── Dockerfile
├── README.md
└── requirements.txt
Service Purpose
dash provides real-time incident response capabilities:
- Infrastructure monitoring via Kubernetes API
- Application observability through OpenTelemetry
- Alert correlation with ML-based grouping
- Real-time updates via WebSockets
- Service dependency mapping
- Code-level insights from distributed traces
Integration with Existing Services
- gateway: Route
/dash/*to dash service - user: Authentication/authorization for dashboard access
- chat: Potential integration for incident notifications
- memory: Cache frequently accessed metrics and correlations
Technical Stack
- Backend: FastAPI + AsyncIO (consistent with project style)
- Frontend: React + TypeScript + Material-UI
- Data Sources: K8s API, Prometheus, Jaeger, AlertManager
- Real-time: WebSocket connections for live updates
- ML: scikit-learn for alert correlation
- Caching: Redis for performance optimization
API Endpoints
GET /dash/health # Service health check
GET /dash/api/v1/overview # Complete infrastructure snapshot
GET /dash/api/v1/service/{name} # Service-specific health details
GET /dash/api/v1/correlations # Active alert correlations
GET /dash/api/v1/timeline # Incident timeline
GET /dash/api/v1/dependencies # Service dependency map
WS /dash/ws # Real-time updates WebSocket
Configuration
Environment variables in apps/dash/.env:
KUBERNETES_NAMESPACE=default
PROMETHEUS_URL=http://prometheus:9090
JAEGER_QUERY_URL=http://jaeger-query:16686
ALERTMANAGER_URL=http://alertmanager:9093
REDIS_URL=redis://redis:6379
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
Success Metrics
- MTTR Reduction: 50% faster incident resolution
- Alert Noise Reduction: 70% fewer false positives through correlation
- Team Adoption: Used in 90% of incident responses
- Performance: <200ms API response times
- Reliability: 99.9% uptime for the dashboard service
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request