Node Doctor

Kubernetes node health monitoring and auto-remediation system. Node Doctor runs as a DaemonSet on each node, performing comprehensive health checks and automatically fixing common problems.

Features

Comprehensive Health Checks
- System resources (CPU, memory, disk)
- Network connectivity (DNS, gateway, external)
- Kubernetes components (kubelet, API server, container runtime)
- Custom plugin execution
- Log pattern matching
Auto-Remediation
- Automatic problem resolution with safety mechanisms
- Cooldown periods and rate limiting
- Circuit breaker pattern
- Dry-run mode for testing
- Remediation history tracking
Multiple Export Channels
- Kubernetes node conditions and events
- Node annotations
- HTTP health endpoints on hostPort
- Prometheus metrics
Safety First
- Multiple layers of protection against remediation storms
- Configurable per-problem remediation strategies
- Audit trail for all remediation actions
- Fail-safe defaults

Architecture

Node Doctor is based on the proven architecture of Node Problem Detector but extends it with comprehensive auto-remediation capabilities.

┌─────────────────────────────────────────────────────────────┐
│                        Node Doctor                           │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   System     │  │   Network    │  │ Kubernetes   │      │
│  │   Monitors   │  │   Monitors   │  │   Monitors   │ ...  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘      │
│         │                  │                  │               │
│         └──────────────────┼──────────────────┘              │
│                            ▼                                  │
│                  ┌─────────────────┐                         │
│                  │     Problem     │                         │
│                  │    Detector     │                         │
│                  │  (Orchestrator) │                         │
│                  └─────────┬───────┘                         │
│                            │                                  │
│         ┌──────────────────┼──────────────────┐             │
│         ▼                  ▼                   ▼             │
│  ┌──────────┐      ┌─────────────┐    ┌─────────────┐      │
│  │    K8s   │      │ Remediator  │    │ Prometheus  │      │
│  │ Exporter │      │   Registry  │    │  Exporter   │      │
│  └──────────┘      └─────────────┘    └─────────────┘      │
└─────────────────────────────────────────────────────────────┘

See docs/architecture.md for detailed architecture information.

Quick Start

Detailed Guide: For comprehensive instructions including Grafana dashboard import and configuration options, see the Quick Start Guide.

Prerequisites

Kubernetes 1.20+
Node OS: Linux

Version Compatibility

Component	Minimum Version	Recommended
Go	1.21	1.22+
Kubernetes	1.20	1.28+
Node OS	Linux kernel 4.15+	5.4+
Container Runtime	containerd 1.6+, Docker 20.10+	containerd 1.7+

Installation

Option 1: Helm Chart (Recommended)

The easiest way to deploy Node Doctor is using the Helm chart:

# Add the SupportTools Helm repository
helm repo add supporttools https://charts.support.tools
helm repo update

# Install Node Doctor
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace

# Verify deployment
kubectl get daemonset -n node-doctor node-doctor
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctor

Custom Installation with Values:

# Install with custom values
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace \
  -f custom-values.yaml

# Or override specific values
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace \
  --set settings.logLevel=debug \
  --set settings.enableRemediation=false

See helm/node-doctor/README.md for complete Helm chart documentation and configuration options.

Option 2: Kubernetes DaemonSet (Manual)

Deploy Node Doctor as a DaemonSet:

kubectl apply -f deployment/rbac.yaml
kubectl apply -f deployment/configmap.yaml
kubectl apply -f deployment/daemonset.yaml

Verify deployment:

kubectl get daemonset -n kube-system node-doctor
kubectl get pods -n kube-system -l app=node-doctor

Check node health:

# Via HTTP endpoint (from a node)
curl http://localhost:8080/health

# Via kubectl
kubectl get nodes -o json | jq '.items[].status.conditions'

Option 3: Standalone Binary (For Testing/Development)

Download pre-built binaries from the releases page:

# Linux amd64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_amd64.tar.gz
tar -xzf node-doctor_linux_amd64.tar.gz
sudo mv node-doctor /usr/local/bin/
node-doctor --version

# Linux arm64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_arm64.tar.gz
tar -xzf node-doctor_linux_arm64.tar.gz
sudo mv node-doctor /usr/local/bin/

# Note: macOS builds temporarily unavailable
# For macOS testing, use Docker or build from source

Verify artifact signatures (recommended for production):

All releases are dual-signed for defense-in-depth security:

Cosign (GitHub OIDC): Proves CI/CD built it
GPG (Maintainer key): Proves maintainer approved it

# Layer 1: Cosign verification (GitHub Actions)
brew install cosign  # macOS / or download from https://github.com/sigstore/cosign/releases
cosign verify-blob \
  --signature node-doctor_linux_amd64.tar.gz.cosign.sig \
  --certificate node-doctor_linux_amd64.tar.gz.cosign.crt \
  --certificate-identity-regexp="https://github.com/supporttools/node-doctor" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  node-doctor_linux_amd64.tar.gz

# Layer 2: GPG verification (Maintainer approval)
brew install gnupg  # macOS / or apt-get install gnupg
gpg --keyserver keyserver.ubuntu.com --recv-keys <MAINTAINER_KEY_ID>  # See release notes
gpg --verify node-doctor_linux_amd64.tar.gz.asc node-doctor_linux_amd64.tar.gz

# Both verifications should pass for production deployments ✅

Option 4: Docker Image

Pull multi-architecture images from Docker Hub:

# Pull latest stable release
docker pull supporttools/node-doctor:latest

# Pull specific version
docker pull supporttools/node-doctor:v1.0.0

# Run locally for testing
docker run --rm supporttools/node-doctor:latest --help

Supported platforms: linux/amd64, linux/arm64

Configuration

Node Doctor is configured via a ConfigMap. The default configuration provides sensible defaults for most environments.

To customize:

Edit deployment/configmap.yaml
Apply changes: kubectl apply -f deployment/configmap.yaml
Node Doctor will automatically reload (or restart pods if needed)

See docs/configuration.md for complete configuration reference.

Health Checks

System Health

CPU: Load average, thermal throttling
Memory: Available memory, OOM conditions, swap usage
Disk: Disk space, inode usage, I/O health, readonly filesystems

Network Health

DNS: Cluster DNS and external DNS resolution
Gateway: Default gateway reachability and latency
Connectivity: External connectivity checks

Kubernetes Components

kubelet: Health endpoint, systemd status, PLEG performance
API Server: Connectivity, latency, authentication
Container Runtime: Docker/containerd/CRI-O health
Pod Capacity: Available pod slots

Custom Checks

Plugin Execution: Run custom health check scripts
Log Patterns: Match problematic log patterns

See docs/monitors.md for detailed monitor documentation.

Remediation

Node Doctor can automatically fix detected problems with multiple safety mechanisms:

Safety Mechanisms

Cooldown Periods: Prevent rapid re-remediation (default: 5 minutes)
Attempt Limiting: Max attempts before giving up (default: 3)
Circuit Breaker: Stop after repeated failures
Rate Limiting: Max remediations per hour (default: 10)
Dry-Run Mode: Test without making changes

Remediation Strategies

systemd-restart: Restart systemd services (kubelet, docker, containerd)
network: Network remediation (flush DNS cache, restart interfaces)
disk: Disk cleanup (clean logs, remove unused images)
runtime: Container runtime remediation
custom-script: Execute custom remediation scripts

See docs/remediation.md for detailed remediation documentation.

HTTP Endpoints

Node Doctor exposes several HTTP endpoints on port 8080 (configurable via hostPort):

GET /health - Overall node health (200=healthy, 503=unhealthy)
GET /ready - Node readiness status
GET /metrics - Prometheus metrics
GET /status - Detailed status of all monitors (JSON)
GET /remediation/history - Remediation action history (JSON)
GET /conditions - Current node conditions (JSON)

Example:

# From the node
curl http://localhost:8080/health

# From another pod (requires hostNetwork or NodePort service)
curl http://<node-ip>:8080/health

Prometheus Metrics

Node Doctor exposes comprehensive Prometheus metrics on port 9100:

# Monitor execution
node_doctor_monitor_checks_total{monitor,result}
node_doctor_monitor_duration_seconds{monitor}

# Problems detected
node_doctor_problems_detected_total{problem,severity}

# Remediation
node_doctor_remediations_attempted_total{problem,remediator,result}
node_doctor_remediations_succeeded_total{problem,remediator}
node_doctor_circuit_breaker_state{problem}

# API operations
node_doctor_api_requests_total{operation,result}

Examples

Minimal Configuration

apiVersion: node-doctor.io/v1alpha1
kind: NodeDoctorConfig
metadata:
  name: minimal

settings:
  nodeName: "${NODE_NAME}"

monitors:
  - name: kubelet-health
    type: kubernetes-kubelet-check
    enabled: true
    interval: 30s

exporters:
  kubernetes:
    enabled: true
  http:
    enabled: true
    hostPort: 8080

Enable Auto-Remediation

monitors:
  - name: kubelet-health
    type: kubernetes-kubelet-check
    enabled: true
    interval: 30s
    remediation:
      enabled: true
      strategy: systemd-restart
      service: kubelet
      cooldown: 5m
      maxAttempts: 3

remediation:
  enabled: true
  maxRemediationsPerHour: 10

Custom Plugin Check

monitors:
  - name: custom-app-health
    type: custom-plugin-check
    enabled: true
    interval: 2m
    config:
      path: /opt/myapp/health_check.sh
      timeout: 30s
    remediation:
      enabled: true
      strategy: custom-script
      scriptPath: /opt/myapp/remediate.sh

Kubernetes Integration

Node Conditions

Node Doctor updates standard Kubernetes node conditions:

kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type | startswith("NodeDoctor"))'

Example conditions:

KubeletUnhealthy
DiskPressure
MemoryPressure
NetworkUnreachable

Events

Node Doctor creates events for:

Problem detection
Remediation attempts
Circuit breaker state changes

kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=<node-name>

Annotations

Node Doctor updates node annotations:

kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq

Example annotations:

node-doctor.io/status: Overall health status
node-doctor.io/version: Node Doctor version
node-doctor.io/last-check: Last check timestamp
node-doctor.io/last-remediation: Last remediation timestamp

Development

Building from Source

# Clone repository
git clone https://github.com/supporttools/node-doctor.git
cd node-doctor

# Build binary
make build

# Build Docker image
make docker-build

Running Locally

# Create local config
cp config/node-doctor.yaml /tmp/config.yaml

# Run with local config
./bin/node-doctor --config=/tmp/config.yaml --log-level=debug

Running Tests

# Unit tests
make test

# Integration tests
make test-integration

# E2E tests (requires Kubernetes cluster)
make test-e2e

# All tests with coverage
make test-all

Project Structure

node-doctor/
├── cmd/node-doctor/          # Main entry point
├── pkg/
│   ├── types/                # Core types
│   ├── detector/             # Problem detection
│   ├── monitors/             # Health monitors
│   ├── remediators/          # Remediation implementations
│   ├── exporters/            # Problem exporters
│   └── util/                 # Utilities
├── config/                   # Example configurations
├── deployment/               # Kubernetes manifests
├── docs/                     # Documentation
└── test/                     # Tests

Documentation

Quick Start Guide - Deploy, configure, and import dashboards
Architecture - System architecture and design
Monitors - Health check monitor implementations
Remediation - Auto-remediation system
Configuration - Configuration reference
Deployment - Deployment guide for production clusters
Release Process - Release management and versioning
Troubleshooting - Common issues and debugging guide

Comparison with Node Problem Detector

Node Doctor is based on Node Problem Detector but adds significant enhancements:

Feature	Node Problem Detector	Node Doctor
Health Checks	✅	✅ Enhanced
Node Conditions	✅	✅
Events	✅	✅
Prometheus Metrics	✅	✅ Enhanced
Node Annotations	❌	✅
HTTP Health Endpoints	⚠️ Basic	✅ Comprehensive
Auto-Remediation	⚠️ Limited	✅ Full-featured
Safety Mechanisms	⚠️ Basic	✅ Multi-layer
Circuit Breaker	❌	✅
Rate Limiting	❌	✅
Dry-Run Mode	❌	✅
Remediation History	❌	✅
Network Health Checks	❌	✅
Disk Cleanup	❌	✅

Roadmap

Phase 1: Core Framework ✅

✅ Architecture design
✅ Documentation
✅ Core types implementation
✅ Monitor interface
✅ Problem detector (orchestrator)

Phase 2: Basic Monitors ✅

✅ System health monitors (CPU, memory, disk)
✅ Network health monitors (DNS, gateway, connectivity)
✅ Kubernetes component monitors (kubelet, API server, runtime)
✅ Custom plugin execution
✅ Log pattern matching

Phase 3: Exporters ✅

✅ Kubernetes exporter (conditions, events, annotations)
✅ HTTP exporter (health endpoints)
✅ Prometheus exporter (metrics)

Phase 4: Remediation ✅

✅ Remediation framework
✅ Safety mechanisms (circuit breaker, rate limiting, cooldowns)
✅ Remediator implementations (systemd, network, disk, runtime, custom)

Phase 5: Testing & Polish (In Progress)

✅ Unit tests
✅ Integration tests
✅ E2E tests
⏳ Performance optimization
⏳ Production hardening

Future Enhancements

⏳ Dynamic configuration reload
Machine learning for anomaly detection
Advanced remediation (node drain, cordon, reboot)
Multi-cluster aggregation
Web UI dashboard

Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

Security

See SECURITY.md for our security policy, vulnerability reporting process, and documentation of Node Doctor's security model including privilege requirements.

License

Apache License 2.0 - see LICENSE for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: docs/

Acknowledgments

Node Doctor is inspired by and based on the architecture of Kubernetes Node Problem Detector. We're grateful to the NPD community for their excellent work.

Related Projects

Node Problem Detector - Kubernetes node problem detection
node-healthcheck-operator - Node health checking operator
node-maintenance-operator - Node maintenance orchestration

Made with ❤️ by the SupportTools team

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.githooks		.githooks
.github/workflows		.github/workflows
cmd		cmd
config		config
dashboards		dashboards
deploy/controller		deploy/controller
deployment		deployment
docs		docs
helm/node-doctor		helm/node-doctor
hooks		hooks
pkg		pkg
scripts		scripts
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
.taskforge		.taskforge
.validation.json		.validation.json
.version-rc		.version-rc
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
go.mod		go.mod
go.sum		go.sum

License

SupportTools/node-doctor

Folders and files

Latest commit

History

Repository files navigation

Node Doctor

Table of Contents

Features

Architecture

Quick Start

Prerequisites

Version Compatibility

Installation

Option 1: Helm Chart (Recommended)

Option 2: Kubernetes DaemonSet (Manual)

Option 3: Standalone Binary (For Testing/Development)

Option 4: Docker Image

Configuration

Health Checks

System Health

Network Health

Kubernetes Components

Custom Checks

Remediation

Safety Mechanisms

Remediation Strategies

HTTP Endpoints

Prometheus Metrics

Examples

Minimal Configuration

Enable Auto-Remediation

Custom Plugin Check

Kubernetes Integration

Node Conditions

Events

Annotations

Development

Building from Source

Running Locally

Running Tests

Project Structure

Documentation

Comparison with Node Problem Detector

Roadmap

Phase 1: Core Framework ✅

Phase 2: Basic Monitors ✅

Phase 3: Exporters ✅

Phase 4: Remediation ✅

Phase 5: Testing & Polish (In Progress)

Future Enhancements

Contributing

Security

License

Support

Acknowledgments

Related Projects

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages