fix: stale data directories cleanup by chefsale · Pull Request #177 · calimero-network/merobox

chefsale · 2026-02-18T10:20:49Z

Adds an option to the nuke command to clean up stale data directories that do not correspond to any running node.

Note

Medium Risk
Touches destructive cleanup logic and adds new heuristics/fail-safe checks around what gets deleted; while guarded by detection/validation, edge cases in node detection or directory validation could still lead to unexpected retention or deletion.

Overview
Adds a --stale mode to merobox nuke (and a stale_only path in execute_nuke) to delete only orphaned data/* node directories that don’t match any currently running node.

Introduces safer running-node detection across Docker containers and binary PID-tracked processes, including a fail-safe that aborts stale cleanup if both detection methods fail (override via --force), plus extra validation to avoid deleting non-Calimero or very recently-created empty directories and a final re-check to avoid TOCTOU races before deletion.

Refactors nuke internals into _stop_running_services and _cleanup_auth_services, and adds comprehensive unit tests covering stale discovery, detection failures/overrides, and deletion behavior.

^{Written by Cursor Bugbot for commit f3c9045. This will update automatically on new commits. Configure here.}

cursor · 2026-02-18T10:20:51Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 89% | Review time: 233.0s

🟡 5 warnings, 💡 1 suggestions, 📝 3 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-c989f828}

merobox/commands/nuke.py

merobox/tests/unit/test_nuke.py

merobox/commands/nuke.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 93% | Review time: 219.2s

🟡 6 warnings, 💡 2 suggestions, 📝 1 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-0449a885}

merobox/commands/nuke.py

merobox/tests/unit/test_nuke.py

merobox/commands/nuke.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 96% | Review time: 205.5s

🔴 1 critical, 🟡 3 warnings, 💡 3 suggestions, 📝 1 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-a7753912}

merobox/commands/nuke.py

meroreviewer · 2026-02-18T10:35:21Z

merobox/commands/nuke.py

+        binary_manager = BinaryManager(
+            require_binary=False, enable_signal_handlers=False
+        )
+        pid_dir = Path("./data/.pids")


💡 Hardcoded PID directory path may diverge from actual data path

The path ./data/.pids is hardcoded while find_calimero_data_dirs uses Path("data") - if the data directory logic ever changes, these could get out of sync.

Suggested fix:

Consider extracting the base data path as a constant or parameter shared between functions.

merobox/commands/nuke.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-18T14:46:32Z

merobox/commands/nuke.py

+        docker_check_failed = True
+    except Exception as e:
+        warnings.append(f"Docker check failed unexpectedly: {e}")
+        docker_check_failed = True


DockerManager exits process, bypassing fail-safe detection

High Severity

When Docker is unavailable, DockerManager.__init__ internally catches the connection error and calls sys.exit(1), which raises SystemExit. Since SystemExit inherits from BaseException (not Exception), neither the except docker.errors.DockerException nor the except Exception handler in _detect_running_nodes will catch it. The program terminates immediately instead of setting docker_check_failed = True. This completely defeats the fail-safe mechanism — nuke --stale will crash when Docker isn't running, even with --force, instead of gracefully degrading as intended.

…ctories-mlrvhwkt

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 95% | Review time: 252.3s

🟡 3 warnings, 💡 4 suggestions. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-79cbfc39}

meroreviewer · 2026-02-19T08:54:42Z

merobox/commands/nuke.py

+        return False
+
+    # Check for expected Calimero node subdirectories or files
+    # A valid Calimero data directory typically contains a subdirectory


🟡 TOCTOU race condition between node detection and deletion

The time gap between get_running_node_names() and nuke_all_data_dirs() allows a node to start after detection but before deletion, potentially causing data loss for a newly-started node.

Suggested fix:

Consider re-checking that directories are still stale immediately before deletion, or add a brief lock/signal mechanism; the documented --dry-run advice is helpful but doesn't prevent the race in automated workflows.

meroreviewer · 2026-02-19T08:54:44Z

merobox/commands/nuke.py

+def _is_valid_calimero_data_dir(data_dir: str) -> bool:
+    """
+    Validate that a directory appears to be legitimate Calimero node data.
+


🟡 Empty directory validation may race with node startup

Returning True for empty directories means a directory from a node that just started (created dir but hasn't written files yet) could be incorrectly marked as valid for deletion.

Suggested fix:

Consider checking the directory's mtime and skipping directories created very recently (e.g., within the last few seconds) to reduce this race window.

meroreviewer · 2026-02-19T08:54:46Z

merobox/commands/nuke.py

+        for data_dir in data_dirs:
+            node_name = os.path.basename(data_dir)
+            try:
+                container = manager.client.containers.get(node_name)


🟡 Silent swallowing of APIError may hide operational failures

Catching docker.errors.APIError with just pass could mask real issues like permission denied or network failures during container stop operations.

Suggested fix:

Log a debug-level message or add to warnings list when APIError occurs, similar to how warnings are collected in `_detect_running_nodes`.

meroreviewer · 2026-02-19T08:54:48Z

merobox/commands/nuke.py

+
+    Returns:
+        DetectionResult containing nodes found and any warnings/errors.
+


💡 DockerManager created but not explicitly closed

When manager is None, a new DockerManager is created but never closed; if DockerManager holds socket connections, this could leak file descriptors over repeated calls.

Suggested fix:

Consider using a context manager pattern or explicitly closing the manager when created locally, or ensure DockerManager handles cleanup via `__del__`.

meroreviewer · 2026-02-19T08:54:50Z

merobox/tests/unit/test_nuke.py

+
+        assert len(results) == 1
+        assert results[0]["status"] == "not_found"
+


💡 Missing test for partial detection failure scenario

Tests cover complete failure and individual mechanism success, but no test verifies that stale_only mode continues correctly when only one detection method fails (partial failure).

Suggested fix:

Add a test case where Docker succeeds but binary detection fails (or vice versa) to ensure stale cleanup proceeds with warnings.

meroreviewer · 2026-02-19T08:54:52Z

merobox/commands/nuke.py

+        nodes=running_nodes,
+        docker_failed=docker_check_failed,
+        binary_failed=binary_check_failed,
+        warnings=warnings,


💡 Weak directory validation accepts directories with common subdirectory names

Accepting any directory containing a logs/ subdirectory as valid Calimero data is permissive and could match unrelated directories if the data path is misconfigured or contains unexpected content.

Suggested fix:

Consider requiring a more specific marker file (e.g., a `.calimero` marker or config file) rather than relying solely on generic subdirectory names.

meroreviewer · 2026-02-19T08:54:54Z

merobox/commands/nuke.py

+        auth_container = manager.client.containers.get("auth")
+        if not silent:
+            console.print("[yellow]Stopping auth service...[/yellow]")
+        auth_container.stop(timeout=NUKE_STOP_TIMEOUT)


💡 Repeated try/except pattern for Docker operations violates DRY

The same try/except pattern (NotFound + APIError) is repeated for auth, proxy, and volume operations; extracting a helper would reduce duplication.

Suggested fix:

Consider a small helper like `_safe_remove_container(client, name, silent)` and `_safe_remove_volume(client, name, silent)` that encapsulates this pattern.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-19T09:01:31Z

merobox/commands/nuke.py

+        docker_check_failed = True
+    except Exception as e:
+        warnings.append(f"Docker check failed unexpectedly: {e}")
+        docker_check_failed = True


SystemExit from DockerManager escapes detection error handling

High Severity

DockerManager.__init__ calls sys.exit(1) when Docker is unavailable, raising SystemExit which inherits from BaseException, not Exception. The except Exception handler in _detect_running_nodes won't catch it, so the entire process terminates instead of gracefully setting docker_check_failed = True. This defeats the fail-safe mechanism — when Docker is down and no pre-existing manager is passed, stale cleanup silently exits rather than falling back to binary-only detection or raising NodeDetectionError.

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 240.6s

🟡 2 warnings, 💡 4 suggestions, 📝 1 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-a72685dd}

meroreviewer · 2026-02-19T09:27:59Z

merobox/commands/nuke.py

+                    warnings.append(f"Failed to check binary process {node_name}: {e}")
+        except Exception as e:
+            warnings.append(f"Binary process check failed: {e}")
+            binary_check_failed = True


🟡 Uncaught exceptions from stat() and iterdir() can crash stale cleanup

If dir_path.iterdir() or dir_path.stat() fail due to permission errors or race conditions (directory deleted mid-check), the exception propagates and aborts the entire stale cleanup operation.

Suggested fix:

Wrap in try/except and return False on OSError/PermissionError to skip problematic directories gracefully.

meroreviewer · 2026-02-19T09:28:01Z

merobox/commands/nuke.py

+                    "[yellow]⚠️  Binary detection failed - binary processes may not be detected[/yellow]"
+                )
+
+    # If both detection mechanisms failed and fail_safe is enabled, raise an error


💡 Exception message couples library code to CLI flag

The NodeDetectionError message references --force, a CLI-specific flag; library-level exceptions should remain UI-agnostic.

Suggested fix:

Move the `--force` suggestion to the CLI layer where the exception is caught, keeping the exception message generic.

meroreviewer · 2026-02-19T09:28:03Z

merobox/commands/nuke.py

+    # Accept old empty directories, but skip very recent ones to reduce
+    # deletion risk during node startup races.
+    if not any(dir_path.iterdir()):
+        dir_age_seconds = time.time() - dir_path.stat().st_mtime


🟡 Potential TOCTOU race in empty directory check

The check not any(dir_path.iterdir()) followed by dir_path.stat() can race if a file is created between the two calls, causing incorrect age calculation or validation.

Suggested fix:

Consider calling `stat()` first to capture mtime, then check `iterdir()`, or use a single atomic operation if possible.

meroreviewer · 2026-02-19T09:28:04Z

merobox/commands/nuke.py

+    # Accept if we find the expected node subdirectory structure
+    if expected_subdir.is_dir():
+        return True
+


💡 Hardcoded silent=True ignores caller's silent parameter

get_running_node_names(fail_safe=fail_safe, silent=True) always passes silent=True but the function has its own silent parameter that goes unused for this call.

Suggested fix:

Change to `get_running_node_names(fail_safe=fail_safe, silent=silent)` to respect caller's preference.

meroreviewer · 2026-02-19T09:28:07Z

merobox/commands/nuke.py

-                    console.print("[green]✓ Traefik proxy stopped and removed[/green]")
-            except Exception:
-                pass
+                    console.print("[yellow]No stale data directories found.[/yellow]")


💡 Double detection when using precomputed_data_dirs with stale_only

When CLI provides precomputed_data_dirs, _filter_still_stale_dirs still runs full node detection again. For the CLI flow this is intentional safety, but the API contract could be clearer that precomputed dirs still get re-verified.

Suggested fix:

Consider documenting this behavior in the docstring or adding a `skip_recheck` parameter for trusted callers.

meroreviewer · 2026-02-19T09:28:08Z

merobox/commands/nuke.py

+        auth_volume = manager.client.volumes.get("calimero_auth_data")
+        if not silent:
+            console.print("[yellow]Removing auth data volume...[/yellow]")
+        auth_volume.remove()


💡 execute_nuke has grown to 7 parameters

The function signature has accumulated many optional parameters making it harder to reason about; consider grouping related flags into a config object or dataclass.

Suggested fix:

Create a `NukeOptions` dataclass with fields `prefix`, `verbose`, `silent`, `stale_only`, `force`, `precomputed_data_dirs` and pass a single options object.

meroreviewer · 2026-02-19T09:28:10Z

merobox/commands/nuke.py

+            filters={"label": "calimero.node=true", "status": "running"}
+        )
+        for container in containers:
+            running_nodes.add(container.name)


📝 Nit: Catching bare Exception in detection code

Catching Exception is intentionally broad here for robustness, but adding a brief inline comment explaining this choice would help future readers.

Suggested fix:

Add a comment like `# Broad catch: detection must not crash the cleanup flow`.

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 316.3s

🟡 1 warnings, 💡 5 suggestions, 📝 1 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-79b0c48b}

meroreviewer · 2026-02-19T09:35:57Z

merobox/commands/nuke.py

+
+    return DetectionResult(
+        nodes=running_nodes,
+        docker_failed=docker_check_failed,


🟡 TOCTOU race can cause unhandled FileNotFoundError

If a directory is deleted between is_dir() check at line 221 and iterdir()/stat() calls here, an unhandled FileNotFoundError will be raised, causing the entire stale cleanup to fail rather than gracefully skipping the deleted directory.

Suggested fix:

Wrap the `iterdir()` and `stat()` calls in a try/except block that catches FileNotFoundError and returns False.

meroreviewer · 2026-02-19T09:35:59Z

merobox/commands/nuke.py

-        binary_nodes_stopped = 0
+    # Stop running Docker containers (if manager is DockerManager)
+    docker_nodes_stopped = 0
+    if manager and hasattr(manager, "client"):


💡 Exception handling may miss non-API Docker errors

Catching only docker.errors.NotFound and docker.errors.APIError will let network timeouts or connection errors propagate and crash the cleanup; the original code used a broad except Exception which was safer for a best-effort cleanup operation.

Suggested fix:

Consider adding a fallback `except Exception as e` with a warning to handle unexpected Docker errors gracefully.

meroreviewer · 2026-02-19T09:36:01Z

merobox/commands/nuke.py

+            )
+
+
+def execute_nuke(


💡 precomputed_data_dirs bypasses validation when called programmatically

When execute_nuke is called directly with precomputed_data_dirs, it bypasses _is_valid_calimero_data_dir validation; while the CLI flow is safe, callers of this internal API must ensure paths are pre-validated.

Suggested fix:

Consider adding a docstring note that `precomputed_data_dirs` must be pre-validated, or optionally re-validate paths when this parameter is used.

meroreviewer · 2026-02-19T09:36:02Z

merobox/commands/nuke.py

+                pass
+
+    # Check binary processes via PID files
+    # Note: If PID directory doesn't exist, we treat it as "no binary nodes"


💡 Symlink validation not re-checked before deletion (TOCTOU)

The symlink check in _is_valid_calimero_data_dir is not re-verified immediately before shutil.rmtree; on platforms without fd-based functions, rmtree may be susceptible to symlink attacks in the window between validation and deletion.

Suggested fix:

Consider re-checking `Path(data_dir).is_symlink()` in `nuke_all_data_dirs` immediately before calling `shutil.rmtree`, or use `shutil.rmtree` with `dir_fd` parameter on supported platforms.

meroreviewer · 2026-02-19T09:36:04Z

merobox/commands/nuke.py

+    if not any(dir_path.iterdir()):
+        dir_age_seconds = time.time() - dir_path.stat().st_mtime
+        return dir_age_seconds >= EMPTY_DATA_DIR_MIN_AGE_SECONDS
+


📝 Nit: Docstring suggests --dry-run mitigates TOCTOU but it doesn't

The docstring advises using --dry-run to preview deletions for TOCTOU safety, but dry-run doesn't mitigate the race; the actual mitigation is _filter_still_stale_dirs re-check which should be mentioned instead.

Suggested fix:

Update docstring to explain that the code re-checks running nodes immediately before deletion to reduce TOCTOU risk

meroreviewer · 2026-02-19T09:36:06Z

merobox/commands/nuke.py

+        return False
+
+    # Check for expected Calimero node subdirectories or files
+    # A valid Calimero data directory typically contains a subdirectory


💡 Redundant node detection when precomputed_data_dirs is used

The CLI flow calls find_stale_data_dirs() which detects running nodes, then passes precomputed_data_dirs to execute_nuke() which calls _filter_still_stale_dirs() detecting nodes again; this is intentional for safety but results in two full Docker/binary detection cycles.

Suggested fix:

Consider caching the running nodes set between detection calls if the time window is small, or document why double-detection is acceptable.

meroreviewer · 2026-02-19T09:36:08Z

merobox/commands/nuke.py

@@ -118,125 +396,215 @@ def nuke_all_data_dirs(data_dirs: list, dry_run: bool = False) -> dict:
    return results


💡 Function has too many parameters (7)

The execute_nuke function accepts 7 parameters which makes it harder to maintain and test; consider grouping related flags into a configuration object.

Suggested fix:

Create a `NukeOptions` dataclass containing `verbose`, `silent`, `stale_only`, `force` and pass that instead

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-19T09:43:23Z

merobox/commands/nuke.py

+                    warnings.append(f"Failed to check binary process {node_name}: {e}")
+        except Exception as e:
+            warnings.append(f"Binary process check failed: {e}")
+            binary_check_failed = True


Fail-safe bypassed in Docker-only environments

Medium Severity

The fail_safe mechanism in get_running_node_names only raises NodeDetectionError when complete_failure is true (both docker_failed AND binary_failed). However, when the PID directory doesn't exist, binary detection is skipped and binary_failed stays False. In Docker-only environments (where the PID directory is never created), complete_failure can never be True — even if Docker detection fails. This means fail_safe=True provides no protection: Docker fails, binary is "skipped-not-failed," all directories appear stale, and running container data gets deleted with only a partial-failure warning.

Additional Locations (2)

merobox/commands/nuke.py#L107-L111

merobox/commands/nuke.py#L146-L153

chefsale marked this pull request as ready for review February 18, 2026 10:21