Kubernetes node health monitoring and auto-remediation system. Node Doctor runs as a DaemonSet on each node, performing comprehensive health checks and automatically fixing common problems.
- Features
- Architecture
- Quick Start
- Health Checks
- Remediation
- HTTP Endpoints
- Prometheus Metrics
- Examples
- Kubernetes Integration
- Development
- Documentation
- Comparison with Node Problem Detector
- Roadmap
- Contributing
- Security
- License
- Support
-
Comprehensive Health Checks
- System resources (CPU, memory, disk)
- Network connectivity (DNS, gateway, external)
- Kubernetes components (kubelet, API server, container runtime)
- Custom plugin execution
- Log pattern matching
-
Auto-Remediation
- Automatic problem resolution with safety mechanisms
- Cooldown periods and rate limiting
- Circuit breaker pattern
- Dry-run mode for testing
- Remediation history tracking
-
Multiple Export Channels
- Kubernetes node conditions and events
- Node annotations
- HTTP health endpoints on hostPort
- Prometheus metrics
-
Safety First
- Multiple layers of protection against remediation storms
- Configurable per-problem remediation strategies
- Audit trail for all remediation actions
- Fail-safe defaults
Node Doctor is based on the proven architecture of Node Problem Detector but extends it with comprehensive auto-remediation capabilities.
┌─────────────────────────────────────────────────────────────┐
│ Node Doctor │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ System │ │ Network │ │ Kubernetes │ │
│ │ Monitors │ │ Monitors │ │ Monitors │ ... │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Problem │ │
│ │ Detector │ │
│ │ (Orchestrator) │ │
│ └─────────┬───────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ K8s │ │ Remediator │ │ Prometheus │ │
│ │ Exporter │ │ Registry │ │ Exporter │ │
│ └──────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
See docs/architecture.md for detailed architecture information.
Detailed Guide: For comprehensive instructions including Grafana dashboard import and configuration options, see the Quick Start Guide.
- Kubernetes 1.20+
- Node OS: Linux
| Component | Minimum Version | Recommended |
|---|---|---|
| Go | 1.21 | 1.22+ |
| Kubernetes | 1.20 | 1.28+ |
| Node OS | Linux kernel 4.15+ | 5.4+ |
| Container Runtime | containerd 1.6+, Docker 20.10+ | containerd 1.7+ |
The easiest way to deploy Node Doctor is using the Helm chart:
# Add the SupportTools Helm repository
helm repo add supporttools https://charts.support.tools
helm repo update
# Install Node Doctor
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace
# Verify deployment
kubectl get daemonset -n node-doctor node-doctor
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctorCustom Installation with Values:
# Install with custom values
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace \
-f custom-values.yaml
# Or override specific values
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace \
--set settings.logLevel=debug \
--set settings.enableRemediation=falseSee helm/node-doctor/README.md for complete Helm chart documentation and configuration options.
- Deploy Node Doctor as a DaemonSet:
kubectl apply -f deployment/rbac.yaml
kubectl apply -f deployment/configmap.yaml
kubectl apply -f deployment/daemonset.yaml- Verify deployment:
kubectl get daemonset -n kube-system node-doctor
kubectl get pods -n kube-system -l app=node-doctor- Check node health:
# Via HTTP endpoint (from a node)
curl http://localhost:8080/health
# Via kubectl
kubectl get nodes -o json | jq '.items[].status.conditions'Download pre-built binaries from the releases page:
# Linux amd64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_amd64.tar.gz
tar -xzf node-doctor_linux_amd64.tar.gz
sudo mv node-doctor /usr/local/bin/
node-doctor --version
# Linux arm64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_arm64.tar.gz
tar -xzf node-doctor_linux_arm64.tar.gz
sudo mv node-doctor /usr/local/bin/
# Note: macOS builds temporarily unavailable
# For macOS testing, use Docker or build from sourceVerify artifact signatures (recommended for production):
All releases are dual-signed for defense-in-depth security:
- Cosign (GitHub OIDC): Proves CI/CD built it
- GPG (Maintainer key): Proves maintainer approved it
# Layer 1: Cosign verification (GitHub Actions)
brew install cosign # macOS / or download from https://github.com/sigstore/cosign/releases
cosign verify-blob \
--signature node-doctor_linux_amd64.tar.gz.cosign.sig \
--certificate node-doctor_linux_amd64.tar.gz.cosign.crt \
--certificate-identity-regexp="https://github.com/supporttools/node-doctor" \
--certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
node-doctor_linux_amd64.tar.gz
# Layer 2: GPG verification (Maintainer approval)
brew install gnupg # macOS / or apt-get install gnupg
gpg --keyserver keyserver.ubuntu.com --recv-keys <MAINTAINER_KEY_ID> # See release notes
gpg --verify node-doctor_linux_amd64.tar.gz.asc node-doctor_linux_amd64.tar.gz
# Both verifications should pass for production deployments ✅Pull multi-architecture images from Docker Hub:
# Pull latest stable release
docker pull supporttools/node-doctor:latest
# Pull specific version
docker pull supporttools/node-doctor:v1.0.0
# Run locally for testing
docker run --rm supporttools/node-doctor:latest --helpSupported platforms: linux/amd64, linux/arm64
Node Doctor is configured via a ConfigMap. The default configuration provides sensible defaults for most environments.
To customize:
- Edit
deployment/configmap.yaml - Apply changes:
kubectl apply -f deployment/configmap.yaml - Node Doctor will automatically reload (or restart pods if needed)
See docs/configuration.md for complete configuration reference.
- CPU: Load average, thermal throttling
- Memory: Available memory, OOM conditions, swap usage
- Disk: Disk space, inode usage, I/O health, readonly filesystems
- DNS: Cluster DNS and external DNS resolution
- Gateway: Default gateway reachability and latency
- Connectivity: External connectivity checks
- kubelet: Health endpoint, systemd status, PLEG performance
- API Server: Connectivity, latency, authentication
- Container Runtime: Docker/containerd/CRI-O health
- Pod Capacity: Available pod slots
- Plugin Execution: Run custom health check scripts
- Log Patterns: Match problematic log patterns
See docs/monitors.md for detailed monitor documentation.
Node Doctor can automatically fix detected problems with multiple safety mechanisms:
- Cooldown Periods: Prevent rapid re-remediation (default: 5 minutes)
- Attempt Limiting: Max attempts before giving up (default: 3)
- Circuit Breaker: Stop after repeated failures
- Rate Limiting: Max remediations per hour (default: 10)
- Dry-Run Mode: Test without making changes
- systemd-restart: Restart systemd services (kubelet, docker, containerd)
- network: Network remediation (flush DNS cache, restart interfaces)
- disk: Disk cleanup (clean logs, remove unused images)
- runtime: Container runtime remediation
- custom-script: Execute custom remediation scripts
See docs/remediation.md for detailed remediation documentation.
Node Doctor exposes several HTTP endpoints on port 8080 (configurable via hostPort):
GET /health- Overall node health (200=healthy, 503=unhealthy)GET /ready- Node readiness statusGET /metrics- Prometheus metricsGET /status- Detailed status of all monitors (JSON)GET /remediation/history- Remediation action history (JSON)GET /conditions- Current node conditions (JSON)
Example:
# From the node
curl http://localhost:8080/health
# From another pod (requires hostNetwork or NodePort service)
curl http://<node-ip>:8080/healthNode Doctor exposes comprehensive Prometheus metrics on port 9100:
# Monitor execution
node_doctor_monitor_checks_total{monitor,result}
node_doctor_monitor_duration_seconds{monitor}
# Problems detected
node_doctor_problems_detected_total{problem,severity}
# Remediation
node_doctor_remediations_attempted_total{problem,remediator,result}
node_doctor_remediations_succeeded_total{problem,remediator}
node_doctor_circuit_breaker_state{problem}
# API operations
node_doctor_api_requests_total{operation,result}
apiVersion: node-doctor.io/v1alpha1
kind: NodeDoctorConfig
metadata:
name: minimal
settings:
nodeName: "${NODE_NAME}"
monitors:
- name: kubelet-health
type: kubernetes-kubelet-check
enabled: true
interval: 30s
exporters:
kubernetes:
enabled: true
http:
enabled: true
hostPort: 8080monitors:
- name: kubelet-health
type: kubernetes-kubelet-check
enabled: true
interval: 30s
remediation:
enabled: true
strategy: systemd-restart
service: kubelet
cooldown: 5m
maxAttempts: 3
remediation:
enabled: true
maxRemediationsPerHour: 10monitors:
- name: custom-app-health
type: custom-plugin-check
enabled: true
interval: 2m
config:
path: /opt/myapp/health_check.sh
timeout: 30s
remediation:
enabled: true
strategy: custom-script
scriptPath: /opt/myapp/remediate.shNode Doctor updates standard Kubernetes node conditions:
kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type | startswith("NodeDoctor"))'Example conditions:
KubeletUnhealthyDiskPressureMemoryPressureNetworkUnreachable
Node Doctor creates events for:
- Problem detection
- Remediation attempts
- Circuit breaker state changes
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=<node-name>Node Doctor updates node annotations:
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jqExample annotations:
node-doctor.io/status: Overall health statusnode-doctor.io/version: Node Doctor versionnode-doctor.io/last-check: Last check timestampnode-doctor.io/last-remediation: Last remediation timestamp
# Clone repository
git clone https://github.com/supporttools/node-doctor.git
cd node-doctor
# Build binary
make build
# Build Docker image
make docker-build# Create local config
cp config/node-doctor.yaml /tmp/config.yaml
# Run with local config
./bin/node-doctor --config=/tmp/config.yaml --log-level=debug# Unit tests
make test
# Integration tests
make test-integration
# E2E tests (requires Kubernetes cluster)
make test-e2e
# All tests with coverage
make test-allnode-doctor/
├── cmd/node-doctor/ # Main entry point
├── pkg/
│ ├── types/ # Core types
│ ├── detector/ # Problem detection
│ ├── monitors/ # Health monitors
│ ├── remediators/ # Remediation implementations
│ ├── exporters/ # Problem exporters
│ └── util/ # Utilities
├── config/ # Example configurations
├── deployment/ # Kubernetes manifests
├── docs/ # Documentation
└── test/ # Tests
- Quick Start Guide - Deploy, configure, and import dashboards
- Architecture - System architecture and design
- Monitors - Health check monitor implementations
- Remediation - Auto-remediation system
- Configuration - Configuration reference
- Deployment - Deployment guide for production clusters
- Release Process - Release management and versioning
- Troubleshooting - Common issues and debugging guide
Node Doctor is based on Node Problem Detector but adds significant enhancements:
| Feature | Node Problem Detector | Node Doctor |
|---|---|---|
| Health Checks | ✅ | ✅ Enhanced |
| Node Conditions | ✅ | ✅ |
| Events | ✅ | ✅ |
| Prometheus Metrics | ✅ | ✅ Enhanced |
| Node Annotations | ❌ | ✅ |
| HTTP Health Endpoints | ✅ Comprehensive | |
| Auto-Remediation | ✅ Full-featured | |
| Safety Mechanisms | ✅ Multi-layer | |
| Circuit Breaker | ❌ | ✅ |
| Rate Limiting | ❌ | ✅ |
| Dry-Run Mode | ❌ | ✅ |
| Remediation History | ❌ | ✅ |
| Network Health Checks | ❌ | ✅ |
| Disk Cleanup | ❌ | ✅ |
- ✅ Architecture design
- ✅ Documentation
- ✅ Core types implementation
- ✅ Monitor interface
- ✅ Problem detector (orchestrator)
- ✅ System health monitors (CPU, memory, disk)
- ✅ Network health monitors (DNS, gateway, connectivity)
- ✅ Kubernetes component monitors (kubelet, API server, runtime)
- ✅ Custom plugin execution
- ✅ Log pattern matching
- ✅ Kubernetes exporter (conditions, events, annotations)
- ✅ HTTP exporter (health endpoints)
- ✅ Prometheus exporter (metrics)
- ✅ Remediation framework
- ✅ Safety mechanisms (circuit breaker, rate limiting, cooldowns)
- ✅ Remediator implementations (systemd, network, disk, runtime, custom)
- ✅ Unit tests
- ✅ Integration tests
- ✅ E2E tests
- ⏳ Performance optimization
- ⏳ Production hardening
- ⏳ Dynamic configuration reload
- Machine learning for anomaly detection
- Advanced remediation (node drain, cordon, reboot)
- Multi-cluster aggregation
- Web UI dashboard
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
See SECURITY.md for our security policy, vulnerability reporting process, and documentation of Node Doctor's security model including privilege requirements.
Apache License 2.0 - see LICENSE for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: docs/
Node Doctor is inspired by and based on the architecture of Kubernetes Node Problem Detector. We're grateful to the NPD community for their excellent work.
- Node Problem Detector - Kubernetes node problem detection
- node-healthcheck-operator - Node health checking operator
- node-maintenance-operator - Node maintenance orchestration
Made with ❤️ by the SupportTools team