A Kubernetes controller for managing automated upgrades of Talos Linux and Kubernetes.
- π Automated Talos node upgrades with intelligent orchestration
- π― Kubernetes upgrades - upgrade Kubernetes to newer versions
- π Safe upgrade execution - upgrades always run from healthy nodes (never self-upgrade)
- π Built-in health checks - CEL-based expressions for custom cluster validation
- π Configurable reboot modes - default or powercycle options
- π Comprehensive status tracking with real-time progress reporting
- β‘ Resilient job execution with automatic retry and pod replacement
- π Prometheus metrics - detailed monitoring of upgrade progress and health
- π― Per-node overrides - use annotations to specify unique versions or schematics for specific nodes
- π·οΈ Node labeling - automatic labels during upgrades for integration with remediation systems
- Talos cluster with API access configured
- Namespace for the controller (e.g.,
system-upgrade)
Allow Talos API access to the desired namespace by applying this config to all of you nodes:
machine:
features:
kubernetesTalosAPIAccess:
allowedKubernetesNamespaces:
- system-upgrade # or the namespace the controller will be installed to
allowedRoles:
- os:admin
enabled: trueInstall the Helm chart:
# Install via Helm
helm install tuppr oci://ghcr.io/home-operations/charts/tuppr \
--version 0.1.0 \
--namespace system-upgradeCreate a TalosUpgrade resource:
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
name: cluster
spec:
talos:
# renovate: datasource=docker depName=ghcr.io/siderolabs/installer
version: v1.11.0 # Required - target Talos version
policy:
debug: true # Optional, verbose logging
force: false # Optional, skip etcd health checks
rebootMode: default # Optional, default|powercycle
placement: soft # Optional, hard|soft
stage: false # Optional, stage upgrade
timeout: 30m # Optional, per-node upgrade timeout
# Custom health checks (optional)
healthChecks:
- apiVersion: v1
kind: Node
expr: status.conditions.exists(c, c.type == "Ready" && c.status == "True")
# Talosctl configuration (optional)
talosctl:
image:
repository: ghcr.io/siderolabs/talosctl # Optional, default
tag: v1.11.0 # Optional, auto-detected
pullPolicy: IfNotPresent # Optional, default
# Maintenance windows (optional)
maintenance:
windows:
- start: "0 2 * * 0" # Cron expression (Sunday 02:00)
duration: "4h" # How long window stays open
timezone: "UTC" # IANA timezone, default UTC
# Node selector (optional)
nodeSelector:
matchExpressions:
# Only upgrade nodes that have opted-in via this label
- {key: tuppr.home-operations.com/upgrade, operator: In, values: ["enabled"]}
# Exclude control plane nodes from this specific plan
- {key: node-role.kubernetes.io/control-plane, operator: DoesNotExist}
# Configure drain behavior (optional)
drain:
# Continue even if there are pods using emptyDir (local data)
deleteLocalData: true
# Ignore DaemonSet-managed pods
ignoreDaemonSets: true
# Force drain even if pods do not declare a controller
force: true
# Optional: Force delete instead of eviction
# disableEviction: false
# Optional: Skip waiting for delete timeout (seconds)
# skipWaitForDeleteTimeout: 0Create a KubernetesUpgrade resource:
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
name: kubernetes
spec:
kubernetes:
# renovate: datasource=docker depName=ghcr.io/siderolabs/kubelet
version: v1.34.0 # Required - target Kubernetes version
# Custom health checks (optional)
healthChecks:
- apiVersion: v1
kind: Node
expr: status.conditions.exists(c, c.type == "Ready" && c.status == "True")
timeout: 10m
# Talosctl configuration (optional)
talosctl:
image:
repository: ghcr.io/siderolabs/talosctl # Optional, default
tag: v1.11.0 # Optional, auto-detected
pullPolicy: IfNotPresent # Optional, default
# Maintenance windows (optional)
maintenance:
windows:
- start: "0 2 * * 0" # Cron expression (Sunday 02:00)
duration: "4h" # How long window stays open
timezone: "UTC" # IANA timezone, default UTCDefine custom health checks using CEL expressions. These health checks are evaluated before each upgrade and run concurrently.
healthChecks:
# Check all nodes are ready
- apiVersion: v1
kind: Node
expr: |
status.conditions.filter(c, c.type == "Ready").all(c, c.status == "True")
timeout: 10m
# Check specific deployment replicas
- apiVersion: apps/v1
kind: Deployment
name: critical-app
namespace: production
expr: status.readyReplicas == status.replicas
# Check custom resources
- apiVersion: ceph.rook.io/v1
kind: CephCluster
name: rook-ceph
namespace: rook-ceph
expr: status.ceph.health in ["HEALTH_OK"]Fine-tune upgrade behavior:
policy:
# Enable debug logging for troubleshooting
debug: true
# Force upgrade even if etcd is unhealthy (dangerous!)
force: true
# Controls how strictly upgrade jobs avoid the target node
placement: hard # or "soft"
# Use powercycle reboot for problematic nodes
rebootMode: powercycle # or "default"
# Stage upgrade then reboot to apply (2 total reboots)
stage: falseControl when upgrades start using cron-based maintenance windows. Running upgrades always complete without interruption.
maintenance:
windows:
- start: "0 2 * * 0" # Sunday 02:00
duration: "4h" # Max 168h, warn if <1h
timezone: "Europe/Paris" # IANA timezone, default UTC- Upgrades only start during open windows (stays
Pendingotherwise) - Multiple windows create union (any open window allows start)
- In-progress upgrades always complete (never interrupted)
- TalosUpgrade re-checks between nodes
- Empty config: upgrades start immediately (backwards compatible)
Tuppr supports overriding the global TalosUpgrade configuration on a per-node basis using Kubernetes annotations. This is useful for testing new versions on a canary node or handling nodes with different hardware schematics.
| Annotation | Description | Example |
|---|---|---|
| tuppr.home-operations.com/version | Overrides the target Talos version for this node. | v1.12.1 |
| tuppr.home-operations.com/schematic | Overrides the Talos schematic hash for this node. | b55fbf... |
Example: Applying an override
# Upgrade a specific node to a different version than the global policy
kubectl annotate node worker-01 tuppr.home-operations.com/version="v1.12.1"
# Apply a custom schematic (with specific extensions) to one node
kubectl annotate node worker-02 tuppr.home-operations.com/schematic="314b18a3f89d..."How it works:
- The controller checks if a node version or schematic matches the annotation instead of the global TalosUpgrade spec.
- If an inconsistency is found, an upgrade job is triggered for that node using the override values.
Tuppr exposes comprehensive Prometheus metrics for monitoring upgrade progress, health check performance, and job execution:
# Current phase of Talos upgrades (0=Pending, 1=InProgress, 2=Completed, 3=Failed)
tuppr_talos_upgrade_phase{name="cluster", phase="InProgress"} 1
# Node counts for Talos upgrades
tuppr_talos_upgrade_nodes_total{name="cluster"} 5
tuppr_talos_upgrade_nodes_completed{name="cluster"} 3
tuppr_talos_upgrade_nodes_failed{name="cluster"} 0
# Duration of Talos upgrade phases (histogram)
tuppr_talos_upgrade_duration_seconds{name="cluster", phase="InProgress"}
# Current phase of Kubernetes upgrades (0=Pending, 1=InProgress, 2=Completed, 3=Failed)
tuppr_kubernetes_upgrade_phase{name="kubernetes", phase="Completed"} 2
# Duration of Kubernetes upgrade phases (histogram)
tuppr_kubernetes_upgrade_duration_seconds{name="kubernetes", phase="InProgress"}
# Time taken for health checks to pass (histogram)
tuppr_health_check_duration_seconds{upgrade_type="talos", upgrade_name="cluster"}
# Total number of health check failures (counter)
tuppr_health_check_failures_total{upgrade_type="talos", upgrade_name="cluster", check_index="0"}
# Number of active upgrade jobs
tuppr_upgrade_jobs_active{upgrade_type="talos"} 1
# Time taken for upgrade jobs to complete (histogram)
tuppr_upgrade_job_duration_seconds{upgrade_type="talos", node_name="worker-01", result="success"}
# Whether upgrade is currently inside a maintenance window (1=inside, 0=outside)
tuppr_maintenance_window_active{upgrade_type="talos", name="talos"} 0
# Unix timestamp of the next maintenance window start
tuppr_maintenance_window_next_open_timestamp{upgrade_type="talos", name="talos"} 1735603200
# Upgrade phase status
tuppr_talos_upgrade_phase or tuppr_kubernetes_upgrade_phase
# Node upgrade progress for Talos
tuppr_talos_upgrade_nodes_completed / tuppr_talos_upgrade_nodes_total * 100
# Health check success rate
rate(tuppr_health_check_failures_total[5m]) == 0
# Average health check duration
histogram_quantile(0.95, rate(tuppr_health_check_duration_seconds_bucket[5m]))
# Job completion rate
rate(tuppr_upgrade_job_duration_seconds_count{result="success"}[5m])
# Active jobs by type
sum(tuppr_upgrade_jobs_active) by (upgrade_type)
Example Prometheus alerting rules:
groups:
- name: tuppr
rules:
- alert: TupprUpgradeFailed
expr: tuppr_talos_upgrade_phase{phase="Failed"} == 3 or tuppr_kubernetes_upgrade_phase{phase="Failed"} == 3
for: 0m
labels:
severity: critical
annotations:
summary: "Tuppr upgrade failed"
description: "Upgrade {{ $labels.name }} has failed"
- alert: TupprUpgradeStuck
expr: |
(
tuppr_talos_upgrade_phase{phase="InProgress"} == 1 and
increase(tuppr_talos_upgrade_nodes_completed[30m]) == 0
) or (
tuppr_kubernetes_upgrade_phase{phase="InProgress"} == 1
)
for: 30m
labels:
severity: warning
annotations:
summary: "Tuppr upgrade appears stuck"
description: "Upgrade {{ $labels.name }} has been in progress for 30+ minutes without completing nodes"
- alert: TupprHealthCheckFailures
expr: rate(tuppr_health_check_failures_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High health check failure rate"
description: "Health checks for {{ $labels.upgrade_name }} are failing frequently"# Watch Talos upgrade progress
kubectl get talosupgrade -w
# Watch Kubernetes upgrade progress
kubectl get kubernetesupgrade -w
# Check detailed status
kubectl describe talosupgrade cluster-upgrade
kubectl describe kubernetesupgrade kubernetes
# View upgrade logs
kubectl logs -f deployment/tuppr -n system-upgrade
# Force a node to a specific version/schematic
kubectl annotate node <node-name> tuppr.home-operations.com/version="v1.10.7"
kubectl annotate node <node-name> tuppr.home-operations.com/schematic="<hash>"
# Check if a node has overrides applied
kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION-OVERRIDE:.metadata.annotations."tuppr\.home-operations\.com/version"
# Check metrics endpoint
kubectl port-forward -n system-upgrade deployment/tuppr 8080:8080
curl http://localhost:8080/metrics | grep tuppr_Suspending upgrades can be useful if you want to upgrade manually and not have the controller interfere.
# Suspend Talos upgrade
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/suspend="true"
# Suspend Kubernetes upgrade
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/suspend="true"
# Remove the suspend annotation to resume
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/suspend-
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/suspend-# Reset failed Talos upgrade
kubectl annotate talosupgrade talos tuppr.home-operations.com/reset="$(date)"
# Reset failed Kubernetes upgrade
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/reset="$(date)"
# Check job logs
kubectl logs job/tuppr-xyz -n system-upgrade
# Check controller health
kubectl get pods -n system-upgrade -l app.kubernetes.io/name=tuppr
# View metrics for debugging
kubectl port-forward -n system-upgrade deployment/tuppr 8080:8080
curl http://localhost:8080/metrics | grep -E "(tuppr_.*_phase|tuppr_.*_duration)"# Pause all upgrades (scale down controller)
kubectl scale deployment tuppr --replicas=0 -n system-upgrade
# Emergency cleanup
kubectl delete talosupgrade --all
kubectl delete kubernetesupgrade --all
kubectl delete jobs -l app.kubernetes.io/name=talos-upgrade -n system-upgrade
kubectl delete jobs -l app.kubernetes.io/name=kubernetes-upgrade -n system-upgrade
# Resume operations
kubectl scale deployment tuppr --replicas=1 -n system-upgrade| Feature | TalosUpgrade | KubernetesUpgrade |
|---|---|---|
| Scope | Talos nodes | Kubernetes cluster |
| Multiple CRs | β Only one per cluster | β Only one per cluster |
| Execution | Sequential node-by-node | Single controller node |
| Reboot Required | β Yes | β No |
| Health Checks | β Before each node | β Before upgrade |
| Concurrent Execution | β Blocked by other upgrades | β Blocked by other upgrades |
| Handling Failures | β Manual | β Manual |
| Metrics | β Comprehensive | β Comprehensive |
-
TalosUpgrade:: You can now define multiple TalosUpgrade resources to target different groups of nodes (e.g., "workers-west" vs "workers-east"). While multiple plans can exist simultaneously, only one plan will execute at a time (First-Come, First-Served). The controller automatically queues subsequent plans to ensure safe, sequential orchestration across the cluster.
-
KubernetesUpgrade: Only one
KubernetesUpgraderesource is allowed per cluster. This constraint exists because Kubernetes upgrades affect the entire cluster, and multiple concurrent upgrades would conflict with each other. The admission webhook will reject attempts to create additionalKubernetesUpgraderesources. -
Cross-Upgrade Coordination: TalosUpgrade and KubernetesUpgrade resources cannot run concurrently. If one upgrade is in progress (status.phase == "InProgress"), the other will wait in a "Pending" state until the active upgrade completes. This prevents conflicts between Talos node changes and Kubernetes cluster changes that could destabilize the cluster.
# β
Valid: Multiple TalosUpgrade Plans (Queued Execution)
# Plan 1: Upgrade worker nodes in west zone
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
name: workers-west
spec:
talos:
version: v1.12.4
nodeSelector:
matchLabels:
topology.kubernetes.io/zone: west
---
# Plan 2: Upgrade worker nodes in east zone
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
name: workers-east
spec:
talos:
version: v1.12.4
nodeSelector:
matchLabels:
topology.kubernetes.io/zone: east
---
# β
Valid: Single KubernetesUpgrade resource
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
name: kubernetes
spec:
kubernetes:
version: v1.34.0
---
# β Invalid: Second KubernetesUpgrade will be rejected by webhook
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
name: another-kubernetes # This will fail validation
spec:
kubernetes:
version: v1.35.0If two active plans target the same node (e.g., Plan A selects role: worker and Plan B selects zone: west, and a node has both labels), the webhook will issue a Warning upon creation. While allowed, this configuration is discouraged as it can cause conflicting upgrade cycles where a node is repeatedly updated by alternating plans.
Scenario 1: TalosUpgrade starts first
kubectl apply -f talos-upgrade.yaml
# β
TalosUpgrade starts immediately (phase: InProgress)
kubectl apply -f kubernetes-upgrade.yaml
# β³ KubernetesUpgrade waits (phase: Pending)
# message: "Waiting for Talos upgrade 'talos' to complete before starting Kubernetes upgrade"
# After TalosUpgrade completes (phase: Completed)
# β
KubernetesUpgrade starts automatically (phase: InProgress)Scenario 2: KubernetesUpgrade starts first
kubectl apply -f kubernetes-upgrade.yaml
# β
KubernetesUpgrade starts immediately (phase: InProgress)
kubectl apply -f talos-upgrade.yaml
# β³ TalosUpgrade waits (phase: Pending)
# message: "Waiting for Kubernetes upgrade 'kubernetes' to complete before starting Talos upgrade"
# After KubernetesUpgrade completes (phase: Completed)
# β
TalosUpgrade starts automatically (phase: InProgress)Scenario 3: Only one upgrade type needed
# If you only need Talos upgrades
kubectl apply -f talos-upgrade.yaml
# β
Starts immediately - no blocking
# If you only need Kubernetes upgrades
kubectl apply -f kubernetes-upgrade.yaml
# β
Starts immediately - no blockingWe welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.
- Talos Linux - The modern OS for Kubernetes that inspired this project
- System Upgrade Controller - Inspiration for upgrade orchestration patterns
- Kubebuilder - Excellent framework for building Kubernetes controllers
- Controller Runtime - Powerful runtime for Kubernetes controllers
- CEL - Common Expression Language for flexible health checks
- Prometheus - Monitoring and alerting toolkit for metrics collection
β If this project helps you, please consider giving it a star!
For questions, issues, or feature requests, please visit our GitHub Issues.