Skip to content
This repository was archived by the owner on Dec 7, 2025. It is now read-only.

Add dash service - Incident Response Dashboard for apps/ directory #5

@justgithubaccount

Description

@justgithubaccount

Dash - Incident Response Dashboard Service

Add dash service to the apps/ directory - a real-time incident response dashboard for infrastructure and code observability.

Project Structure Integration

Based on the existing project structure, dash will be added as:

apps/
├── chat/
├── gateway/
├── memory/
├── seo/
├── user/
└── dash/              # ← New incident response dashboard
    ├── app/
    │   ├── main.py
    │   ├── aggregator.py
    │   ├── correlator.py
    │   └── websockets.py
    ├── frontend/
    │   ├── src/
    │   └── package.json
    ├── tests/
    ├── .env
    ├── Dockerfile
    ├── README.md
    └── requirements.txt

Service Purpose

dash provides real-time incident response capabilities:

  • Infrastructure monitoring via Kubernetes API
  • Application observability through OpenTelemetry
  • Alert correlation with ML-based grouping
  • Real-time updates via WebSockets
  • Service dependency mapping
  • Code-level insights from distributed traces

Integration with Existing Services

  • gateway: Route /dash/* to dash service
  • user: Authentication/authorization for dashboard access
  • chat: Potential integration for incident notifications
  • memory: Cache frequently accessed metrics and correlations

Technical Stack

  • Backend: FastAPI + AsyncIO (consistent with project style)
  • Frontend: React + TypeScript + Material-UI
  • Data Sources: K8s API, Prometheus, Jaeger, AlertManager
  • Real-time: WebSocket connections for live updates
  • ML: scikit-learn for alert correlation
  • Caching: Redis for performance optimization

API Endpoints

GET  /dash/health                    # Service health check
GET  /dash/api/v1/overview          # Complete infrastructure snapshot
GET  /dash/api/v1/service/{name}    # Service-specific health details
GET  /dash/api/v1/correlations      # Active alert correlations
GET  /dash/api/v1/timeline          # Incident timeline
GET  /dash/api/v1/dependencies      # Service dependency map
WS   /dash/ws                       # Real-time updates WebSocket

Configuration

Environment variables in apps/dash/.env:

KUBERNETES_NAMESPACE=default
PROMETHEUS_URL=http://prometheus:9090
JAEGER_QUERY_URL=http://jaeger-query:16686
ALERTMANAGER_URL=http://alertmanager:9093
REDIS_URL=redis://redis:6379
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

Success Metrics

  • MTTR Reduction: 50% faster incident resolution
  • Alert Noise Reduction: 70% fewer false positives through correlation
  • Team Adoption: Used in 90% of incident responses
  • Performance: <200ms API response times
  • Reliability: 99.9% uptime for the dashboard service

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions