@@ -481,10 +481,12 @@ subjects:
481481` ` `
482482node-doctor/
483483├── cmd/
484- │ └── node-doctor/ # Main entry point
485- │ ├── main.go # Application bootstrap
486- │ └── options/ # CLI flag definitions
487- │ └── options.go
484+ │ ├── node-doctor/ # DaemonSet entry point
485+ │ │ ├── main.go # Application bootstrap
486+ │ │ └── options/ # CLI flag definitions
487+ │ │ └── options.go
488+ │ └── node-doctor-controller/ # Controller entry point
489+ │ └── main.go # Controller bootstrap
488490│
489491├── pkg/
490492│ ├── types/ # Core type definitions
@@ -544,6 +546,13 @@ node-doctor/
544546│ │ ├── exporter.go
545547│ │ └── metrics.go
546548│ │
549+ │ ├── controller/ # Controller component
550+ │ │ ├── server.go # HTTP server and routes
551+ │ │ ├── storage.go # SQLite storage layer
552+ │ │ ├── correlator.go # Pattern correlation engine
553+ │ │ ├── metrics.go # Prometheus metrics
554+ │ │ └── types.go # API types
555+ │ │
547556│ └── util/ # Utility functions
548557│ ├── config.go # Configuration loading
549558│ ├── kube.go # Kubernetes client helpers
@@ -557,12 +566,20 @@ node-doctor/
557566│ ├── full-featured.yaml
558567│ └── custom-plugins.yaml
559568│
560- ├── deployment/ # Kubernetes manifests
569+ ├── deployment/ # DaemonSet manifests
561570│ ├── daemonset.yaml
562571│ ├── rbac.yaml
563572│ ├── configmap.yaml
564573│ └── service.yaml
565574│
575+ ├── deploy/controller/ # Controller manifests
576+ │ ├── deployment.yaml
577+ │ ├── service.yaml
578+ │ ├── pvc.yaml
579+ │ ├── rbac.yaml
580+ │ ├── configmap.yaml
581+ │ └── kustomization.yaml
582+ │
566583├── test/ # Tests
567584│ ├── e2e/ # End-to-end tests
568585│ ├── integration/ # Integration tests
@@ -572,7 +589,10 @@ node-doctor/
572589 ├── architecture.md # This document
573590 ├── monitors.md # Monitor implementation guide
574591 ├── remediation.md # Remediation guide
575- └── configuration.md # Configuration reference
592+ ├── configuration.md # Configuration reference
593+ ├── controller-deployment.md # Controller deployment guide
594+ ├── testing-guide.md # Developer testing guide
595+ └── testing.md # Operational testing guide
576596```
577597
578598## Threading Model
@@ -753,18 +773,194 @@ Structured JSON logging with fields:
753773- Inject rapid problem flapping
754774- Resource exhaustion scenarios
755775
776+ ## Controller Component
777+
778+ The Node Doctor Controller is an optional central component that provides cluster-wide aggregation, pattern correlation, and remediation coordination.
779+
780+ ### Controller Architecture
781+
782+ ```
783+ ┌─────────────────────────────────────────────────────────────────┐
784+ │ Node 1 (DaemonSet) Node 2 Node N │
785+ │ ┌──────────────────┐ ┌──────────────┐ ┌──────────────┐ │
786+ │ │ Monitors │ │ Monitors │ │ Monitors │ │
787+ │ │ ↓ │ │ ↓ │ │ ↓ │ │
788+ │ │ HTTP Exporter │ │ HTTP Export │ │ HTTP Export │ │
789+ │ │ (webhook push) │ │ (webhook) │ │ (webhook) │ │
790+ │ └────────┬─────────┘ └──────┬───────┘ └──────┬───────┘ │
791+ │ │ │ │ │
792+ │ │ POST /api/v1/reports (every 30s) │ │
793+ │ └────────────────────┼──────────────────┘ │
794+ │ ↓ │
795+ │ ┌─────────────────────────────────────────────────────────┐ │
796+ │ │ Node Doctor Controller (Deployment) │ │
797+ │ │ │ │
798+ │ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │
799+ │ │ │ Aggregator │ → │ Correlator │ → │ Lease Manager │ │ │
800+ │ │ │ (receive) │ │ (patterns) │ │ (coordination)│ │ │
801+ │ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │
802+ │ │ ↓ ↓ ↓ │ │
803+ │ │ ┌─────────────────────────────────────────────────┐ │ │
804+ │ │ │ SQLite Storage (PVC) │ │ │
805+ │ │ │ - Node reports (30 day retention) │ │ │
806+ │ │ │ - Correlation events │ │ │
807+ │ │ │ - Active leases │ │ │
808+ │ │ └─────────────────────────────────────────────────┘ │ │
809+ │ │ ↓ ↓ ↓ │ │
810+ │ │ ┌───────────┐ ┌───────────┐ ┌─────────────────┐ │ │
811+ │ │ │ REST API │ │ Prometheus│ │ K8s Events │ │ │
812+ │ │ │ /api/v1/* │ │ /metrics │ │ (cluster-level) │ │ │
813+ │ │ └───────────┘ └───────────┘ └─────────────────┘ │ │
814+ │ └─────────────────────────────────────────────────────────┘ │
815+ └─────────────────────────────────────────────────────────────────┘
816+ ```
817+
818+ ### Controller Components
819+
820+ #### 1. Aggregator
821+
822+ Receives health reports from all node-doctor agents:
823+
824+ - ** Report Ingestion** : ` POST /api/v1/reports ` endpoint receives node reports
825+ - ** Storage** : SQLite database with configurable retention (default 30 days)
826+ - ** Deduplication** : Handles repeated reports from the same node
827+ - ** State Tracking** : Maintains current state of all nodes
828+
829+ #### 2. Correlation Engine
830+
831+ Detects patterns across multiple nodes:
832+
833+ ** Infrastructure Correlation** :
834+ - Triggers when ≥30% of nodes report the same problem type
835+ - Indicates cluster-wide infrastructure issues (DNS, network, storage)
836+
837+ ** Common-Cause Correlation** :
838+ - Detects related problems occurring together (e.g., memory + disk pressure)
839+ - Identifies root cause vs symptoms
840+
841+ ** Cascade Correlation** :
842+ - Detects sequential problem chains (e.g., kubelet → pods → node)
843+ - Identifies failure propagation patterns
844+
845+ ``` go
846+ // Correlation detection thresholds
847+ type CorrelationConfig struct {
848+ ClusterWideThreshold float64 // Fraction of nodes (default: 0.3)
849+ MinNodesForCorrelation int // Minimum nodes (default: 2)
850+ EvaluationInterval time.Duration // How often to evaluate (default: 30s)
851+ }
852+ ```
853+
854+ #### 3. Lease Manager
855+
856+ Coordinates remediation across the cluster:
857+
858+ ** Lease Flow** :
859+ ```
860+ Node requests remediation:
861+ 1. POST /api/v1/leases {node, remediationType, reason}
862+ 2. Controller checks:
863+ - Active leases < maxConcurrent?
864+ - Node doesn't have active lease?
865+ - Cooldown period passed?
866+ 3. Response: 200 OK (approved) or 429 (denied)
867+ 4. Node proceeds with remediation
868+ 5. DELETE /api/v1/leases/{id} (release)
869+ ```
870+
871+ ** Safety Features** :
872+ - ** Max Concurrent** : Limits simultaneous remediations cluster-wide
873+ - ** Cooldown Period** : Prevents repeated remediations on same node
874+ - ** Lease Expiration** : Auto-releases leases after timeout
875+ - ** Fallback Mode** : Configurable behavior when controller unreachable
876+
877+ ### Controller API
878+
879+ | Endpoint | Method | Description |
880+ | ----------| --------| -------------|
881+ | ` /healthz ` | GET | Liveness probe |
882+ | ` /readyz ` | GET | Readiness probe |
883+ | ` /api/v1/reports ` | POST | Receive node health report |
884+ | ` /api/v1/cluster/status ` | GET | Cluster health summary |
885+ | ` /api/v1/cluster/problems ` | GET | Active cluster problems |
886+ | ` /api/v1/nodes ` | GET | List all nodes |
887+ | ` /api/v1/nodes/{name} ` | GET | Node details |
888+ | ` /api/v1/nodes/{name}/history ` | GET | Node report history |
889+ | ` /api/v1/correlations ` | GET | Active correlations |
890+ | ` /api/v1/leases ` | POST/GET | Request/list leases |
891+ | ` /api/v1/leases/{id} ` | DELETE | Release lease |
892+ | ` /metrics ` | GET | Prometheus metrics |
893+
894+ ### Controller Metrics
895+
896+ ``` prometheus
897+ # Cluster health
898+ node_doctor_cluster_nodes_total
899+ node_doctor_cluster_nodes_healthy
900+ node_doctor_cluster_nodes_unhealthy
901+ node_doctor_cluster_nodes_unknown
902+
903+ # Problem aggregation
904+ node_doctor_cluster_problem_nodes{problem_type, severity}
905+ node_doctor_cluster_problem_active{problem_type}
906+
907+ # Correlations
908+ node_doctor_correlation_active_total
909+ node_doctor_correlation_detected_total{type}
910+
911+ # Remediation coordination
912+ node_doctor_leases_active_total
913+ node_doctor_leases_granted_total
914+ node_doctor_leases_denied_total{reason}
915+ ```
916+
917+ ### Node-to-Controller Communication
918+
919+ Nodes communicate with the controller via HTTP webhooks:
920+
921+ ``` yaml
922+ # Node DaemonSet configuration
923+ exporters :
924+ http :
925+ webhooks :
926+ - name : controller
927+ url : " http://node-doctor-controller.node-doctor:8080/api/v1/reports"
928+ interval : 30s
929+ timeout : 10s
930+
931+ remediation :
932+ coordination :
933+ enabled : true
934+ controllerURL : " http://node-doctor-controller.node-doctor:8080"
935+ leaseTimeout : 5m
936+ fallbackOnUnreachable : false # Block if controller unreachable
937+ ` ` `
938+
939+ ### Controller Deployment
940+
941+ The controller runs as a single-replica Deployment with PVC storage:
942+
943+ - **Deployment**: Single replica (SQLite requires single writer)
944+ - **Storage**: PersistentVolumeClaim for SQLite database
945+ - **Service**: ClusterIP for node-to-controller communication
946+ - **RBAC**: ClusterRole for node read, event creation
947+
948+ See [Controller Deployment Guide](controller-deployment.md) for detailed setup.
949+
950+ ---
951+
756952## Future Enhancements
757953
758954### Planned Features
759955
7609561. **Dynamic Configuration Reload**: Watch ConfigMap, reload without restart
7619572. **Custom Metrics**: Allow monitors to export custom metrics
7629583. **Notification Webhooks**: Alert external systems on problems
763- 4 . ** Multi-Cluster Support ** : Aggregate status across clusters
959+ 4. **Multi-Cluster Controller **: Aggregate status across clusters
7649605. **ML-Based Anomaly Detection**: Learn normal patterns, detect deviations
7659616. **Advanced Remediation**: Node drain, cordon, reboot
7669627. **Health Check Profiles**: Pre-defined configurations for common use cases
767- 8 . ** Web UI** : Dashboard for node health visualization
963+ 8. **Web UI**: Dashboard for cluster health visualization
768964
769965### Extension Points
770966
0 commit comments