Skip to content

fix: stale data directories cleanup#177

Open
chefsale wants to merge 4 commits intomasterfrom
bounty-fix/add-cleanup-for-stale-data-directories-mlrvhwkt
Open

fix: stale data directories cleanup#177
chefsale wants to merge 4 commits intomasterfrom
bounty-fix/add-cleanup-for-stale-data-directories-mlrvhwkt

Conversation

@chefsale
Copy link
Member

@chefsale chefsale commented Feb 18, 2026

Adds an option to the nuke command to clean up stale data directories that do not correspond to any running node.


Open in Cursor Open in Web


Note

Medium Risk
Touches destructive cleanup logic and adds new heuristics/fail-safe checks around what gets deleted; while guarded by detection/validation, edge cases in node detection or directory validation could still lead to unexpected retention or deletion.

Overview
Adds a --stale mode to merobox nuke (and a stale_only path in execute_nuke) to delete only orphaned data/* node directories that don’t match any currently running node.

Introduces safer running-node detection across Docker containers and binary PID-tracked processes, including a fail-safe that aborts stale cleanup if both detection methods fail (override via --force), plus extra validation to avoid deleting non-Calimero or very recently-created empty directories and a final re-check to avoid TOCTOU races before deletion.

Refactors nuke internals into _stop_running_services and _cleanup_auth_services, and adds comprehensive unit tests covering stale discovery, detection failures/overrides, and deletion behavior.

Written by Cursor Bugbot for commit f3c9045. This will update automatically on new commits. Configure here.

@cursor
Copy link

cursor bot commented Feb 18, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@chefsale chefsale marked this pull request as ready for review February 18, 2026 10:21
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 89% | Review time: 233.0s

🟡 5 warnings, 💡 1 suggestions, 📝 3 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-c989f828

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 93% | Review time: 219.2s

🟡 6 warnings, 💡 2 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-0449a885

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 96% | Review time: 205.5s

🔴 1 critical, 🟡 3 warnings, 💡 3 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-a7753912

binary_manager = BinaryManager(
require_binary=False, enable_signal_handlers=False
)
pid_dir = Path("./data/.pids")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Hardcoded PID directory path may diverge from actual data path

The path ./data/.pids is hardcoded while find_calimero_data_dirs uses Path("data") - if the data directory logic ever changes, these could get out of sync.

Suggested fix:

Consider extracting the base data path as a constant or parameter shared between functions.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

docker_check_failed = True
except Exception as e:
warnings.append(f"Docker check failed unexpectedly: {e}")
docker_check_failed = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DockerManager exits process, bypassing fail-safe detection

High Severity

When Docker is unavailable, DockerManager.__init__ internally catches the connection error and calls sys.exit(1), which raises SystemExit. Since SystemExit inherits from BaseException (not Exception), neither the except docker.errors.DockerException nor the except Exception handler in _detect_running_nodes will catch it. The program terminates immediately instead of setting docker_check_failed = True. This completely defeats the fail-safe mechanism — nuke --stale will crash when Docker isn't running, even with --force, instead of gracefully degrading as intended.

Fix in Cursor Fix in Web

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 95% | Review time: 252.3s

🟡 3 warnings, 💡 4 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-79cbfc39

return False

# Check for expected Calimero node subdirectories or files
# A valid Calimero data directory typically contains a subdirectory
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 TOCTOU race condition between node detection and deletion

The time gap between get_running_node_names() and nuke_all_data_dirs() allows a node to start after detection but before deletion, potentially causing data loss for a newly-started node.

Suggested fix:

Consider re-checking that directories are still stale immediately before deletion, or add a brief lock/signal mechanism; the documented --dry-run advice is helpful but doesn't prevent the race in automated workflows.

def _is_valid_calimero_data_dir(data_dir: str) -> bool:
"""
Validate that a directory appears to be legitimate Calimero node data.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Empty directory validation may race with node startup

Returning True for empty directories means a directory from a node that just started (created dir but hasn't written files yet) could be incorrectly marked as valid for deletion.

Suggested fix:

Consider checking the directory's mtime and skipping directories created very recently (e.g., within the last few seconds) to reduce this race window.

for data_dir in data_dirs:
node_name = os.path.basename(data_dir)
try:
container = manager.client.containers.get(node_name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Silent swallowing of APIError may hide operational failures

Catching docker.errors.APIError with just pass could mask real issues like permission denied or network failures during container stop operations.

Suggested fix:

Log a debug-level message or add to warnings list when APIError occurs, similar to how warnings are collected in `_detect_running_nodes`.


Returns:
DetectionResult containing nodes found and any warnings/errors.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 DockerManager created but not explicitly closed

When manager is None, a new DockerManager is created but never closed; if DockerManager holds socket connections, this could leak file descriptors over repeated calls.

Suggested fix:

Consider using a context manager pattern or explicitly closing the manager when created locally, or ensure DockerManager handles cleanup via `__del__`.


assert len(results) == 1
assert results[0]["status"] == "not_found"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Missing test for partial detection failure scenario

Tests cover complete failure and individual mechanism success, but no test verifies that stale_only mode continues correctly when only one detection method fails (partial failure).

Suggested fix:

Add a test case where Docker succeeds but binary detection fails (or vice versa) to ensure stale cleanup proceeds with warnings.

nodes=running_nodes,
docker_failed=docker_check_failed,
binary_failed=binary_check_failed,
warnings=warnings,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Weak directory validation accepts directories with common subdirectory names

Accepting any directory containing a logs/ subdirectory as valid Calimero data is permissive and could match unrelated directories if the data path is misconfigured or contains unexpected content.

Suggested fix:

Consider requiring a more specific marker file (e.g., a `.calimero` marker or config file) rather than relying solely on generic subdirectory names.

auth_container = manager.client.containers.get("auth")
if not silent:
console.print("[yellow]Stopping auth service...[/yellow]")
auth_container.stop(timeout=NUKE_STOP_TIMEOUT)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Repeated try/except pattern for Docker operations violates DRY

The same try/except pattern (NotFound + APIError) is repeated for auth, proxy, and volume operations; extracting a helper would reduce duplication.

Suggested fix:

Consider a small helper like `_safe_remove_container(client, name, silent)` and `_safe_remove_volume(client, name, silent)` that encapsulates this pattern.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

docker_check_failed = True
except Exception as e:
warnings.append(f"Docker check failed unexpectedly: {e}")
docker_check_failed = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemExit from DockerManager escapes detection error handling

High Severity

DockerManager.__init__ calls sys.exit(1) when Docker is unavailable, raising SystemExit which inherits from BaseException, not Exception. The except Exception handler in _detect_running_nodes won't catch it, so the entire process terminates instead of gracefully setting docker_check_failed = True. This defeats the fail-safe mechanism — when Docker is down and no pre-existing manager is passed, stale cleanup silently exits rather than falling back to binary-only detection or raising NodeDetectionError.

Fix in Cursor Fix in Web

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 240.6s

🟡 2 warnings, 💡 4 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-a72685dd

warnings.append(f"Failed to check binary process {node_name}: {e}")
except Exception as e:
warnings.append(f"Binary process check failed: {e}")
binary_check_failed = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Uncaught exceptions from stat() and iterdir() can crash stale cleanup

If dir_path.iterdir() or dir_path.stat() fail due to permission errors or race conditions (directory deleted mid-check), the exception propagates and aborts the entire stale cleanup operation.

Suggested fix:

Wrap in try/except and return False on OSError/PermissionError to skip problematic directories gracefully.

"[yellow]⚠️ Binary detection failed - binary processes may not be detected[/yellow]"
)

# If both detection mechanisms failed and fail_safe is enabled, raise an error
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Exception message couples library code to CLI flag

The NodeDetectionError message references --force, a CLI-specific flag; library-level exceptions should remain UI-agnostic.

Suggested fix:

Move the `--force` suggestion to the CLI layer where the exception is caught, keeping the exception message generic.

# Accept old empty directories, but skip very recent ones to reduce
# deletion risk during node startup races.
if not any(dir_path.iterdir()):
dir_age_seconds = time.time() - dir_path.stat().st_mtime
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Potential TOCTOU race in empty directory check

The check not any(dir_path.iterdir()) followed by dir_path.stat() can race if a file is created between the two calls, causing incorrect age calculation or validation.

Suggested fix:

Consider calling `stat()` first to capture mtime, then check `iterdir()`, or use a single atomic operation if possible.

# Accept if we find the expected node subdirectory structure
if expected_subdir.is_dir():
return True

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Hardcoded silent=True ignores caller's silent parameter

get_running_node_names(fail_safe=fail_safe, silent=True) always passes silent=True but the function has its own silent parameter that goes unused for this call.

Suggested fix:

Change to `get_running_node_names(fail_safe=fail_safe, silent=silent)` to respect caller's preference.

console.print("[green]✓ Traefik proxy stopped and removed[/green]")
except Exception:
pass
console.print("[yellow]No stale data directories found.[/yellow]")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Double detection when using precomputed_data_dirs with stale_only

When CLI provides precomputed_data_dirs, _filter_still_stale_dirs still runs full node detection again. For the CLI flow this is intentional safety, but the API contract could be clearer that precomputed dirs still get re-verified.

Suggested fix:

Consider documenting this behavior in the docstring or adding a `skip_recheck` parameter for trusted callers.

auth_volume = manager.client.volumes.get("calimero_auth_data")
if not silent:
console.print("[yellow]Removing auth data volume...[/yellow]")
auth_volume.remove()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 execute_nuke has grown to 7 parameters

The function signature has accumulated many optional parameters making it harder to reason about; consider grouping related flags into a config object or dataclass.

Suggested fix:

Create a `NukeOptions` dataclass with fields `prefix`, `verbose`, `silent`, `stale_only`, `force`, `precomputed_data_dirs` and pass a single options object.

filters={"label": "calimero.node=true", "status": "running"}
)
for container in containers:
running_nodes.add(container.name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Catching bare Exception in detection code

Catching Exception is intentionally broad here for robustness, but adding a brief inline comment explaining this choice would help future readers.

Suggested fix:

Add a comment like `# Broad catch: detection must not crash the cleanup flow`.

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 316.3s

🟡 1 warnings, 💡 5 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-79b0c48b


return DetectionResult(
nodes=running_nodes,
docker_failed=docker_check_failed,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 TOCTOU race can cause unhandled FileNotFoundError

If a directory is deleted between is_dir() check at line 221 and iterdir()/stat() calls here, an unhandled FileNotFoundError will be raised, causing the entire stale cleanup to fail rather than gracefully skipping the deleted directory.

Suggested fix:

Wrap the `iterdir()` and `stat()` calls in a try/except block that catches FileNotFoundError and returns False.

binary_nodes_stopped = 0
# Stop running Docker containers (if manager is DockerManager)
docker_nodes_stopped = 0
if manager and hasattr(manager, "client"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Exception handling may miss non-API Docker errors

Catching only docker.errors.NotFound and docker.errors.APIError will let network timeouts or connection errors propagate and crash the cleanup; the original code used a broad except Exception which was safer for a best-effort cleanup operation.

Suggested fix:

Consider adding a fallback `except Exception as e` with a warning to handle unexpected Docker errors gracefully.

)


def execute_nuke(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 precomputed_data_dirs bypasses validation when called programmatically

When execute_nuke is called directly with precomputed_data_dirs, it bypasses _is_valid_calimero_data_dir validation; while the CLI flow is safe, callers of this internal API must ensure paths are pre-validated.

Suggested fix:

Consider adding a docstring note that `precomputed_data_dirs` must be pre-validated, or optionally re-validate paths when this parameter is used.

pass

# Check binary processes via PID files
# Note: If PID directory doesn't exist, we treat it as "no binary nodes"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Symlink validation not re-checked before deletion (TOCTOU)

The symlink check in _is_valid_calimero_data_dir is not re-verified immediately before shutil.rmtree; on platforms without fd-based functions, rmtree may be susceptible to symlink attacks in the window between validation and deletion.

Suggested fix:

Consider re-checking `Path(data_dir).is_symlink()` in `nuke_all_data_dirs` immediately before calling `shutil.rmtree`, or use `shutil.rmtree` with `dir_fd` parameter on supported platforms.

if not any(dir_path.iterdir()):
dir_age_seconds = time.time() - dir_path.stat().st_mtime
return dir_age_seconds >= EMPTY_DATA_DIR_MIN_AGE_SECONDS

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Docstring suggests --dry-run mitigates TOCTOU but it doesn't

The docstring advises using --dry-run to preview deletions for TOCTOU safety, but dry-run doesn't mitigate the race; the actual mitigation is _filter_still_stale_dirs re-check which should be mentioned instead.

Suggested fix:

Update docstring to explain that the code re-checks running nodes immediately before deletion to reduce TOCTOU risk

return False

# Check for expected Calimero node subdirectories or files
# A valid Calimero data directory typically contains a subdirectory
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Redundant node detection when precomputed_data_dirs is used

The CLI flow calls find_stale_data_dirs() which detects running nodes, then passes precomputed_data_dirs to execute_nuke() which calls _filter_still_stale_dirs() detecting nodes again; this is intentional for safety but results in two full Docker/binary detection cycles.

Suggested fix:

Consider caching the running nodes set between detection calls if the time window is small, or document why double-detection is acceptable.

@@ -118,125 +396,215 @@ def nuke_all_data_dirs(data_dirs: list, dry_run: bool = False) -> dict:
return results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Function has too many parameters (7)

The execute_nuke function accepts 7 parameters which makes it harder to maintain and test; consider grouping related flags into a configuration object.

Suggested fix:

Create a `NukeOptions` dataclass containing `verbose`, `silent`, `stale_only`, `force` and pass that instead

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

warnings.append(f"Failed to check binary process {node_name}: {e}")
except Exception as e:
warnings.append(f"Binary process check failed: {e}")
binary_check_failed = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail-safe bypassed in Docker-only environments

Medium Severity

The fail_safe mechanism in get_running_node_names only raises NodeDetectionError when complete_failure is true (both docker_failed AND binary_failed). However, when the PID directory doesn't exist, binary detection is skipped and binary_failed stays False. In Docker-only environments (where the PID directory is never created), complete_failure can never be True — even if Docker detection fails. This means fail_safe=True provides no protection: Docker fails, binary is "skipped-not-failed," all directories appear stale, and running container data gets deleted with only a partial-failure warning.

Additional Locations (2)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants