Skip to content

Commit 108a8ad

Browse files
committed
feat(tests): add controller integration tests and Phase 5 documentation
Add comprehensive testing and documentation for the Node Doctor Controller: Integration Tests: - Controller integration tests with lease coordination, correlation detection - Concurrent lease request and report ingestion tests - Health endpoint and API error handling tests - Added Handler() method to server.go for httptest compatibility E2E Tests: - Controller coordination E2E tests (lease lifecycle, max concurrent) - Multi-node correlation E2E tests (infrastructure, common-cause) - Kubectl utility functions for E2E test operations Documentation: - Controller deployment guide (prerequisites, configuration, API reference) - Developer testing guide (unit/integration/E2E patterns, fixtures, mocks) - Updated architecture.md with Controller Component section Completes Phase 5 of the multi-node aggregation implementation.
1 parent 0110e96 commit 108a8ad

File tree

8 files changed

+3428
-8
lines changed

8 files changed

+3428
-8
lines changed

docs/architecture.md

Lines changed: 204 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -481,10 +481,12 @@ subjects:
481481
```
482482
node-doctor/
483483
├── cmd/
484-
│ └── node-doctor/ # Main entry point
485-
│ ├── main.go # Application bootstrap
486-
│ └── options/ # CLI flag definitions
487-
│ └── options.go
484+
│ ├── node-doctor/ # DaemonSet entry point
485+
│ │ ├── main.go # Application bootstrap
486+
│ │ └── options/ # CLI flag definitions
487+
│ │ └── options.go
488+
│ └── node-doctor-controller/ # Controller entry point
489+
│ └── main.go # Controller bootstrap
488490
489491
├── pkg/
490492
│ ├── types/ # Core type definitions
@@ -544,6 +546,13 @@ node-doctor/
544546
│ │ ├── exporter.go
545547
│ │ └── metrics.go
546548
│ │
549+
│ ├── controller/ # Controller component
550+
│ │ ├── server.go # HTTP server and routes
551+
│ │ ├── storage.go # SQLite storage layer
552+
│ │ ├── correlator.go # Pattern correlation engine
553+
│ │ ├── metrics.go # Prometheus metrics
554+
│ │ └── types.go # API types
555+
│ │
547556
│ └── util/ # Utility functions
548557
│ ├── config.go # Configuration loading
549558
│ ├── kube.go # Kubernetes client helpers
@@ -557,12 +566,20 @@ node-doctor/
557566
│ ├── full-featured.yaml
558567
│ └── custom-plugins.yaml
559568
560-
├── deployment/ # Kubernetes manifests
569+
├── deployment/ # DaemonSet manifests
561570
│ ├── daemonset.yaml
562571
│ ├── rbac.yaml
563572
│ ├── configmap.yaml
564573
│ └── service.yaml
565574
575+
├── deploy/controller/ # Controller manifests
576+
│ ├── deployment.yaml
577+
│ ├── service.yaml
578+
│ ├── pvc.yaml
579+
│ ├── rbac.yaml
580+
│ ├── configmap.yaml
581+
│ └── kustomization.yaml
582+
566583
├── test/ # Tests
567584
│ ├── e2e/ # End-to-end tests
568585
│ ├── integration/ # Integration tests
@@ -572,7 +589,10 @@ node-doctor/
572589
├── architecture.md # This document
573590
├── monitors.md # Monitor implementation guide
574591
├── remediation.md # Remediation guide
575-
└── configuration.md # Configuration reference
592+
├── configuration.md # Configuration reference
593+
├── controller-deployment.md # Controller deployment guide
594+
├── testing-guide.md # Developer testing guide
595+
└── testing.md # Operational testing guide
576596
```
577597

578598
## Threading Model
@@ -753,18 +773,194 @@ Structured JSON logging with fields:
753773
- Inject rapid problem flapping
754774
- Resource exhaustion scenarios
755775

776+
## Controller Component
777+
778+
The Node Doctor Controller is an optional central component that provides cluster-wide aggregation, pattern correlation, and remediation coordination.
779+
780+
### Controller Architecture
781+
782+
```
783+
┌─────────────────────────────────────────────────────────────────┐
784+
│ Node 1 (DaemonSet) Node 2 Node N │
785+
│ ┌──────────────────┐ ┌──────────────┐ ┌──────────────┐ │
786+
│ │ Monitors │ │ Monitors │ │ Monitors │ │
787+
│ │ ↓ │ │ ↓ │ │ ↓ │ │
788+
│ │ HTTP Exporter │ │ HTTP Export │ │ HTTP Export │ │
789+
│ │ (webhook push) │ │ (webhook) │ │ (webhook) │ │
790+
│ └────────┬─────────┘ └──────┬───────┘ └──────┬───────┘ │
791+
│ │ │ │ │
792+
│ │ POST /api/v1/reports (every 30s) │ │
793+
│ └────────────────────┼──────────────────┘ │
794+
│ ↓ │
795+
│ ┌─────────────────────────────────────────────────────────┐ │
796+
│ │ Node Doctor Controller (Deployment) │ │
797+
│ │ │ │
798+
│ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │
799+
│ │ │ Aggregator │ → │ Correlator │ → │ Lease Manager │ │ │
800+
│ │ │ (receive) │ │ (patterns) │ │ (coordination)│ │ │
801+
│ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │
802+
│ │ ↓ ↓ ↓ │ │
803+
│ │ ┌─────────────────────────────────────────────────┐ │ │
804+
│ │ │ SQLite Storage (PVC) │ │ │
805+
│ │ │ - Node reports (30 day retention) │ │ │
806+
│ │ │ - Correlation events │ │ │
807+
│ │ │ - Active leases │ │ │
808+
│ │ └─────────────────────────────────────────────────┘ │ │
809+
│ │ ↓ ↓ ↓ │ │
810+
│ │ ┌───────────┐ ┌───────────┐ ┌─────────────────┐ │ │
811+
│ │ │ REST API │ │ Prometheus│ │ K8s Events │ │ │
812+
│ │ │ /api/v1/* │ │ /metrics │ │ (cluster-level) │ │ │
813+
│ │ └───────────┘ └───────────┘ └─────────────────┘ │ │
814+
│ └─────────────────────────────────────────────────────────┘ │
815+
└─────────────────────────────────────────────────────────────────┘
816+
```
817+
818+
### Controller Components
819+
820+
#### 1. Aggregator
821+
822+
Receives health reports from all node-doctor agents:
823+
824+
- **Report Ingestion**: `POST /api/v1/reports` endpoint receives node reports
825+
- **Storage**: SQLite database with configurable retention (default 30 days)
826+
- **Deduplication**: Handles repeated reports from the same node
827+
- **State Tracking**: Maintains current state of all nodes
828+
829+
#### 2. Correlation Engine
830+
831+
Detects patterns across multiple nodes:
832+
833+
**Infrastructure Correlation**:
834+
- Triggers when ≥30% of nodes report the same problem type
835+
- Indicates cluster-wide infrastructure issues (DNS, network, storage)
836+
837+
**Common-Cause Correlation**:
838+
- Detects related problems occurring together (e.g., memory + disk pressure)
839+
- Identifies root cause vs symptoms
840+
841+
**Cascade Correlation**:
842+
- Detects sequential problem chains (e.g., kubelet → pods → node)
843+
- Identifies failure propagation patterns
844+
845+
```go
846+
// Correlation detection thresholds
847+
type CorrelationConfig struct {
848+
ClusterWideThreshold float64 // Fraction of nodes (default: 0.3)
849+
MinNodesForCorrelation int // Minimum nodes (default: 2)
850+
EvaluationInterval time.Duration // How often to evaluate (default: 30s)
851+
}
852+
```
853+
854+
#### 3. Lease Manager
855+
856+
Coordinates remediation across the cluster:
857+
858+
**Lease Flow**:
859+
```
860+
Node requests remediation:
861+
1. POST /api/v1/leases {node, remediationType, reason}
862+
2. Controller checks:
863+
- Active leases < maxConcurrent?
864+
- Node doesn't have active lease?
865+
- Cooldown period passed?
866+
3. Response: 200 OK (approved) or 429 (denied)
867+
4. Node proceeds with remediation
868+
5. DELETE /api/v1/leases/{id} (release)
869+
```
870+
871+
**Safety Features**:
872+
- **Max Concurrent**: Limits simultaneous remediations cluster-wide
873+
- **Cooldown Period**: Prevents repeated remediations on same node
874+
- **Lease Expiration**: Auto-releases leases after timeout
875+
- **Fallback Mode**: Configurable behavior when controller unreachable
876+
877+
### Controller API
878+
879+
| Endpoint | Method | Description |
880+
|----------|--------|-------------|
881+
| `/healthz` | GET | Liveness probe |
882+
| `/readyz` | GET | Readiness probe |
883+
| `/api/v1/reports` | POST | Receive node health report |
884+
| `/api/v1/cluster/status` | GET | Cluster health summary |
885+
| `/api/v1/cluster/problems` | GET | Active cluster problems |
886+
| `/api/v1/nodes` | GET | List all nodes |
887+
| `/api/v1/nodes/{name}` | GET | Node details |
888+
| `/api/v1/nodes/{name}/history` | GET | Node report history |
889+
| `/api/v1/correlations` | GET | Active correlations |
890+
| `/api/v1/leases` | POST/GET | Request/list leases |
891+
| `/api/v1/leases/{id}` | DELETE | Release lease |
892+
| `/metrics` | GET | Prometheus metrics |
893+
894+
### Controller Metrics
895+
896+
```prometheus
897+
# Cluster health
898+
node_doctor_cluster_nodes_total
899+
node_doctor_cluster_nodes_healthy
900+
node_doctor_cluster_nodes_unhealthy
901+
node_doctor_cluster_nodes_unknown
902+
903+
# Problem aggregation
904+
node_doctor_cluster_problem_nodes{problem_type, severity}
905+
node_doctor_cluster_problem_active{problem_type}
906+
907+
# Correlations
908+
node_doctor_correlation_active_total
909+
node_doctor_correlation_detected_total{type}
910+
911+
# Remediation coordination
912+
node_doctor_leases_active_total
913+
node_doctor_leases_granted_total
914+
node_doctor_leases_denied_total{reason}
915+
```
916+
917+
### Node-to-Controller Communication
918+
919+
Nodes communicate with the controller via HTTP webhooks:
920+
921+
```yaml
922+
# Node DaemonSet configuration
923+
exporters:
924+
http:
925+
webhooks:
926+
- name: controller
927+
url: "http://node-doctor-controller.node-doctor:8080/api/v1/reports"
928+
interval: 30s
929+
timeout: 10s
930+
931+
remediation:
932+
coordination:
933+
enabled: true
934+
controllerURL: "http://node-doctor-controller.node-doctor:8080"
935+
leaseTimeout: 5m
936+
fallbackOnUnreachable: false # Block if controller unreachable
937+
```
938+
939+
### Controller Deployment
940+
941+
The controller runs as a single-replica Deployment with PVC storage:
942+
943+
- **Deployment**: Single replica (SQLite requires single writer)
944+
- **Storage**: PersistentVolumeClaim for SQLite database
945+
- **Service**: ClusterIP for node-to-controller communication
946+
- **RBAC**: ClusterRole for node read, event creation
947+
948+
See [Controller Deployment Guide](controller-deployment.md) for detailed setup.
949+
950+
---
951+
756952
## Future Enhancements
757953
758954
### Planned Features
759955
760956
1. **Dynamic Configuration Reload**: Watch ConfigMap, reload without restart
761957
2. **Custom Metrics**: Allow monitors to export custom metrics
762958
3. **Notification Webhooks**: Alert external systems on problems
763-
4. **Multi-Cluster Support**: Aggregate status across clusters
959+
4. **Multi-Cluster Controller**: Aggregate status across clusters
764960
5. **ML-Based Anomaly Detection**: Learn normal patterns, detect deviations
765961
6. **Advanced Remediation**: Node drain, cordon, reboot
766962
7. **Health Check Profiles**: Pre-defined configurations for common use cases
767-
8. **Web UI**: Dashboard for node health visualization
963+
8. **Web UI**: Dashboard for cluster health visualization
768964
769965
### Extension Points
770966

0 commit comments

Comments
 (0)