Trademark Notice: Backup-monitor is an independent, third-party tool. It is not affiliated with, endorsed by, or sponsored by Veeam Software or Kasten, Inc. "Kasten," "K10," "Kasten K10," and "Veeam Kasten" are trademarks of Veeam Software. All other trademarks are the property of their respective owners.
Backup-monitor is a smart stuck-action detection and cancellation tool for Veeam Kasten K10 backup environments running on Kubernetes (EKS, AKS, GKE, OpenShift, K3s, and more).
Kasten K10 backup, export, and restore actions can get stuck in Pending, Running, or AttemptFailed states indefinitely — silently breaking your backup schedule with no alert. Manual intervention is required, but identifying which actions are genuinely stuck vs. legitimately long-running is difficult at scale.
Backup-monitor uses multi-signal detection to identify genuinely stuck K10 actions and safely cancel them — without killing legitimate long-running operations. It works as a CLI tool, a CronJob, or an on-demand diagnostic dashboard for your Kubernetes backup infrastructure.
Unlike blindly cancelling all non-complete actions, backup-monitor uses three layers of detection:
| Signal | Condition | Applies to |
|---|---|---|
| Age threshold | Action older than --max-age (default 24h) |
All actions (required gate) |
| No progress | progress field is 0 after threshold |
All action types |
| Error present | status.error.message is non-empty |
All action types |
| Pending forever | State is Pending (never started) |
All actions |
| AttemptFailed | Stuck in retry loop | All actions |
Running actions older than the threshold are only cancelled if they also show no progress or have an error — protecting healthy long-running operations.
A read-only overview of all K10 policies and their current state, similar to the K10 UI dashboard:
$ backup-monitor --check
=== K10 Policy Status Dashboard ===
NAME NAMESPACE ACTION LAST RUN STATUS
--------------------------------------------------------------------------------------------------------------------
notes-app-backup notes-app Snapshot + Export Sat Feb 21 2026 12:38 PM Running
[OK ] RunAction policy-run-xxxxx age=0h Running progress=— policy=notes-app-backup
[OK ] BackupAction scheduled-xxxxx age=0h Running progress=— policy=notes-app-backup
[OK ] ExportAction policy-run-xxxxx age=0h Running progress=5% policy=notes-app-backup
crypto-analyzer-backup crypto-analyzer Snapshot + Export Sat Feb 21 2026 10:13 AM Skipped
=== Summary ===
Policies: 9 (7 complete, not shown)
Failed: 0
Skipped: 1
Running: 1
Stuck: 0
Never run: 0
- Shows policy names, target namespaces, action types, last run time, and status
- Completed policies are hidden (count shown in summary) — use
--show-recent-completedto view them - Active actions are expanded underneath each running policy with health labels (
OK,OLD,STUCK) - Searches both
kasten-ioand application namespaces for actions
A standalone view of policies whose most recent action completed successfully:
$ backup-monitor --show-recent-completed
=== Recently Completed K10 Policies ===
NAME NAMESPACE ACTION COMPLETED AT
--------------------------------------------------------------------------------------------
notes-app-backup notes-app Snapshot Sat Feb 21 2026 12:45 PM
crypto-analyzer-backup crypto-analyzer Export Sat Feb 21 2026 10:20 AM
2 completed policies.
- Shows policy name, target namespace, last action type, and completion time
- Complements
--check, which hides completed policies behind a summary count
$ backup-monitor --dry-run --max-age 48h
[DRY RUN] No changes will be made.
Stuck detection: actions older than 48h
STUCK: BackupAction scheduled-abc123 [Running] policy=myapp-backup — age 72h — no progress (progress=0 after 72h)
-> Would attempt cancel, then delete if cancel fails
=== Summary ===
Found: 3 actions in target states (Pending/Running/AttemptFailed)
Skipped: 2 (too young or Running without stuck signals)
Stuck: 1 actions identified as stuck
Cancelled: 0
Deleted: 0
Failed: 0
--- Detection breakdown ---
Pending (never started): 0
AttemptFailed (retry): 0
No progress (stalled): 1
Error signal: 0
- Attempts K10
CancelActionfirst (graceful cancellation) - Falls back to direct deletion for
Pendingactions that can't be cancelled - Re-checks action state before cancelling (handles race conditions)
- Validates resource names before YAML interpolation
# Clone the repository
git clone https://github.com/gekap/backup-monitor.git
cd backup-monitor
# Install dependencies
pip install -r requirements.txt # or: pip install .
# Check your K10 backup status (read-only, safe to run anytime)
backup-monitor --check
# Preview what would be cancelled (dry run, no changes)
backup-monitor --dry-run
# Cancel stuck actions older than 24 hours
backup-monitorRequirements: Python 3.10+, kubectl with access to your K10 cluster.
backup-monitor [--dry-run] [--max-age <duration>] [--check] [--show-recent-completed]
[--show-fingerprint] [--license-key <key>] [--webhook-url <url>] [--version]
| Flag | Description |
|---|---|
--check / --monitor |
Status dashboard — show all policies and active actions, then exit |
--show-recent-completed |
Show recently completed policies with completion time, then exit |
--dry-run |
Show what would be cancelled without making changes |
--max-age <dur> |
Only target actions older than this (default: 24h, minimum: 1h). Supports h (hours) and d (days): 12h, 24h, 2d, 72h |
--show-fingerprint |
Print the cluster fingerprint and exit (use this to request a license) |
--license-key <key> |
Save a license key for this cluster (persisted in DB) |
--webhook-url <url> |
Send alerts to a Slack/Teams webhook URL when stuck or failed actions are detected |
--version |
Show version and exit |
-h / --help |
Show usage |
# Check current policy status (read-only)
backup-monitor --check
# View recently completed policies
backup-monitor --show-recent-completed
# Preview what would be cancelled (default: actions older than 24h)
backup-monitor --dry-run
# Preview with custom threshold
backup-monitor --dry-run --max-age 48h
# Cancel stuck actions older than 2 days
backup-monitor --max-age 2d
# Get your cluster fingerprint for license requests
backup-monitor --show-fingerprint
# Save a license key
backup-monitor --license-key <your-key>- Python 3.10+
kubectlconfigured with access to the K10 namespace (kasten-io)- Veeam Kasten K10 installed on the target cluster
- Outbound HTTPS access to
https://backup-monitor.gr(for license compliance telemetry on unlicensed production/DR clusters). Can be disabled withBACKUP_MONITOR_NO_PHONE_HOME=true
- Scans all 12 K10 action types across
kasten-ioand application namespaces - For each action in
Pending,Running, orAttemptFailedstate:- Computes age from
status.startTime(Running) ormetadata.creationTimestamp(fallback) - Skips actions younger than
--max-age - For
Runningactions, requires an additional stuck signal (no progress or error present) PendingandAttemptFailedactions older than the threshold are always considered stuck
- Computes age from
- Cancels via K10
CancelActionCRD (graceful), falls back tokubectl deleteif CancelAction fails - Cancelling a
RunActioncascades to cancel all child actions (BackupAction, ExportAction, etc.)
Backup-monitor is a Python package (backup_monitor/) with the following modules:
| Module | Purpose |
|---|---|
cli.py |
CLI entry point — argparse, check/cancel/completed modes |
kubectl.py |
Thin kubectl subprocess wrapper with safe defaults |
db.py |
SQLite persistence layer (~/.backup-monitor.db) — replaces legacy flat files |
compliance.py |
License compliance engine — fingerprinting, environment detection, telemetry |
All state (fingerprints, run counts, audit logs, config) is stored in a single SQLite database at ~/.backup-monitor.db. On first run, legacy flat files (~/.backup-monitor-state, ~/.backup-monitor-audit, ~/.backup-monitor-fingerprint) are automatically migrated and renamed to .migrated.
The tool includes a two-tier licensing model powered by compliance.py. It is free on dev, UAT, and staging clusters but requires a license on production and DR environments. This system is non-blocking — it never prevents the tool from running, but production/DR clusters will see a persistent license banner on every run until a valid key is provided.
On startup, the tool:
-
Generates a cluster fingerprint — SHA256 hash of the
kube-systemnamespace UID, truncated to 16 characters. Anonymous and deterministic (same cluster always produces the same ID). Stored in~/.backup-monitor.db. -
Detects environment type by checking cluster naming signals (first match wins):
| Priority | Signal | Source | API Call? |
|---|---|---|---|
| 1 | kubectl config current-context |
Context name | No |
| 2 | Cluster name from kubeconfig | kubectl config view --minify |
No |
| 3 | Namespace labels (env= / environment=) |
Namespace metadata | Yes |
| 4 | Node labels (env= / environment=) |
Node metadata | Yes |
| 5 | Server URL hostname | Kubeconfig | No |
Each signal is matched against word-boundary patterns for known environments:
- production:
prod,prd,production,live - dr:
dr,disaster-recovery,failover,standby - uat:
uat,acceptance,pre-prod,preprod - staging:
staging,stg,stage - dev:
dev,develop,development,sandbox,test,testing,lab,local,minikube,kind,k3s,docker-desktop
You can override detection by setting BACKUP_MONITOR_ENVIRONMENT (e.g., export BACKUP_MONITOR_ENVIRONMENT=dev).
-
Determines license requirement based on detected environment:
- production or dr → license required
- dev, uat, or staging → free, no license needed
- unknown (no signal matched) → falls back to enterprise scoring (score >= 3 = license required)
-
Enterprise scoring (0-5 points, used as fallback for unknown environments):
| Signal | Points | Detection Method |
|---|---|---|
| Node count > 3 | +1 | kubectl get nodes |
| Managed K8s (EKS/AKS/GKE/OpenShift) | +1 | Node labels + server version |
| Namespace count > 10 | +1 | kubectl get namespaces |
| HA control plane (>1 control-plane node) | +1 | Node labels + apiserver pod count |
| Paid K10 license (>5 nodes + license present) | +1 | K10 configmap/secret |
-
License key validation — on license-required clusters, the banner cannot be suppressed without a valid license key tied to the cluster fingerprint.
BACKUP_MONITOR_NO_BANNER=trueis ignored on license-required clusters. -
Optional telemetry — only when explicitly opted in via environment variables.
Production and DR users will see a banner like this on every run:
================================================================================
BACKUP-MONITOR — Production Environment (Unlicensed)
================================================================================
Environment: production (detected via context:prod-eks-cluster)
Cluster ID: a1b2c3d4e5f67890
...
To obtain a license key for this cluster, contact:
georgios.kapellakis@yandex.com
Include your Cluster ID in the request. Once received:
export BACKUP_MONITOR_LICENSE_KEY=<your-key>
================================================================================
Each license key is unique to a cluster fingerprint and cannot be reused across clusters.
You can also use CLI flags to manage licensing:
# Get your cluster fingerprint
backup-monitor --show-fingerprint
# Save a license key (persisted in SQLite DB)
backup-monitor --license-key <your-key>| Variable | Default | Purpose |
|---|---|---|
BACKUP_MONITOR_LICENSE_KEY |
unset | License key for this cluster (suppresses banner on production/DR clusters) |
BACKUP_MONITOR_ENVIRONMENT |
unset | Override auto-detected environment (production, dr, uat, staging, dev) |
BACKUP_MONITOR_NO_BANNER |
unset | Set to true to suppress the banner (only works on non-license-required clusters) |
BACKUP_MONITOR_NO_PHONE_HOME |
unset | Set to true to disable automatic license compliance telemetry and notifications |
BACKUP_MONITOR_WEBHOOK_URL |
unset | Slack/Teams webhook URL for stuck/failed action alerts |
BACKUP_MONITOR_DB_PATH |
~/.backup-monitor.db |
Custom path for the SQLite database |
Unlicensed production and DR runs automatically send license compliance data to the project maintainer. This includes:
- Telemetry report — JSON POST to
https://backup-monitor.gr/api/v1/telemetry - DNS beacon — lightweight DNS lookup to
<fingerprint>.b.backup-monitor.gr(works even when HTTPS is firewalled)
Data transmitted:
| Field | Description | Example |
|---|---|---|
fingerprint |
Anonymous cluster hash (SHA256 of kube-system UID) | 9f997317edb46fb6 |
environment |
Detected environment type | production |
env_source |
How the environment was detected | context:prod-eks-eu |
server_url |
Kubernetes API server URL (from local kubeconfig) | https://k8s.example.com:6443 |
provider |
Cloud provider | EKS |
node_count |
Number of cluster nodes | 8 |
cp_nodes |
Number of control-plane nodes | 3 |
namespace_count |
Number of namespaces | 25 |
k10_version |
Installed K10 version | 7.0.5 |
enterprise_score |
Enterprise detection score (0-5) | 4 |
license_key_provided |
Whether a license key was set | true / false |
license_key_valid |
Whether the provided key is valid | true / false |
unlicensed_run_count |
Number of unlicensed runs on this cluster | 3 |
tool_version |
Backup-monitor version | 1.0.0 |
timestamp |
UTC timestamp | 2026-02-23T15:30:00Z |
The receiving server also captures the source IP address from the HTTP request.
When it fires:
- Every unlicensed run on a production or DR cluster
- When tamper detection is triggered
When it does NOT fire:
- Dev, UAT, or staging environments (license not required)
- Licensed production/DR clusters (valid
BACKUP_MONITOR_LICENSE_KEY) - When
BACKUP_MONITOR_NO_PHONE_HOME=trueis set - After the first failed attempt (network unreachable) — never retries
Both channels use standard protocols (HTTPS port 443 and DNS port 53) with a 5-second timeout. If the first attempt fails (e.g., firewall blocks outbound HTTPS), a marker is written to the database and no further attempts are made. Network requirement: outbound access to backup-monitor.gr on port 443 (HTTPS) and optionally port 53 (DNS). This is fully documented here and visible in the source code (backup_monitor/compliance.py).
Unlicensed production/DR runs incur a startup delay that increases with each run:
| Run # | Delay | Formula |
|---|---|---|
| 1 | 10s | 10 + (1-1) × 60 |
| 2 | 70s | 10 + (2-1) × 60 |
| 3 | 130s | 10 + (3-1) × 60 |
| N | ... | 10 + (N-1) × 60 |
- Ctrl+C is blocked during the delay
- The run counter is HMAC-protected — editing the database triggers tamper detection, sets the counter to 50 (penalty), and sends an alert
- All events are logged to the
audit_logtable in~/.backup-monitor.db
- All kubectl calls are guarded — detection failures produce defaults, never crash the tool
- The banner never appears when
--helpis used (exits before compliance check) - Environment detection adds minimal overhead (first two signals are local, no API calls)
- Telemetry uses try-once semantics — if the network blocks it, it never retries
- Legacy flat files are automatically migrated to SQLite on first run
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) — see LICENSE for details.
This means you are free to use, modify, and distribute this tool, but any modifications or derivative works must also be released under AGPL-3.0, including when used to provide a network service.
This tool is provided as-is, without warranty of any kind. Use at your own risk. Always test with --dry-run first.
If your organization requires a proprietary/commercial license (without AGPL copyleft obligations), enterprise support, custom integrations, or SLA-backed maintenance, see COMMERCIAL_LICENSE.md or contact: georgios.kapellakis@yandex.com
| Platform | Supported |
|---|---|
| Amazon EKS | Yes |
| Azure AKS | Yes |
| Google GKE | Yes |
| Red Hat OpenShift | Yes |
| Rancher RKE/RKE2 | Yes |
| K3s | Yes |
| Vanilla Kubernetes | Yes |
| VMware Tanzu | Yes |
Works with Kasten K10 v5.x, v6.x, and v7.x on any CNCF-conformant Kubernetes distribution.
- No more silent backup failures — catch stuck K10 actions before they break your RPO/RTO
- Safe by default — multi-signal detection protects healthy long-running operations
- Zero infrastructure — single Python script, no agents, no sidecar containers
- CronJob-ready — automate stuck action cleanup on a schedule
- Dashboard included — instant visibility into all K10 policy states
- Webhook alerts — get notified on Slack or Microsoft Teams when actions get stuck
- Free for dev/staging — no license required for non-production environments