feat: signal handler race condition by chefsale · Pull Request #170 · calimero-network/merobox

chefsale · 2026-02-18T09:49:20Z

refactor: Fix race condition in signal handler calling sys.exit during async execu

This PR addresses a race condition where sys.exit() in signal handlers could corrupt async execution state and prevents double-cleanup by atexit handlers.

The _signal_handler now uses os._exit() for immediate termination without raising SystemExit, and _cleanup_resources is made thread-safe with a lock to ensure it runs only once.

Note

Medium Risk
Touches shutdown/cleanup paths for both Docker- and process-based node management, so regressions could leave containers/processes running or exit unexpectedly; changes are mitigated by added concurrency-focused unit tests.

Overview
Introduces a shared CleanupMixin used by both DockerManager and BinaryManager to centralize SIGINT/SIGTERM + atexit cleanup, with an RLock-based guard so cleanup runs at most once and reports status via new CleanupResult.

Refactors both managers to implement _do_cleanup() and route cleanup through the guarded mechanism; DockerManager._cleanup_resources() now returns CleanupResult and supports re-entrant calls without raising SystemExit, with new unit tests covering double-cleanup and concurrent cleanup races for both managers.

^{Written by Cursor Bugbot for commit 82482fb. This will update automatically on new commits. Configure here.}

cursor · 2026-02-18T09:49:21Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 278.6s

💡 1 suggestions. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-6608a2d7}

merobox/commands/manager.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 299.5s

💡 2 suggestions. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-14782b8b}

merobox/tests/unit/test_docker_manager.py

merobox/commands/manager.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 89% | Review time: 237.6s

📝 3 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-54b5b2bd}

merobox/tests/unit/test_docker_manager.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 83% | Review time: 242.7s

💡 1 suggestions, 📝 1 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-8c2d72b5}

merobox/commands/manager.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 89% | Review time: 183.4s

💡 2 suggestions, 📝 1 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-5d4e1b4e}

meroreviewer · 2026-02-18T12:47:43Z

merobox/tests/unit/test_docker_manager.py

+def test_docker_manager_cleanup_concurrent_access(mock_docker):
+    """Test that _cleanup_resources handles concurrent access correctly.
+
+    Spawns multiple threads calling _cleanup_resources simultaneously to verify


💡 Concurrent test could verify return value distribution

The test verifies cleanup ran once but doesn't capture return values to assert exactly one thread got True and the rest got False.

Suggested fix:

Collect return values from each thread and assert that exactly one is True and the rest are False.

meroreviewer · 2026-02-18T12:47:45Z

merobox/commands/manager.py

-        sys.exit(0)
+        # Only call sys.exit() if cleanup completed or was already done.
+        # If cleanup was in progress (re-entrant call), return without exit
+        # to avoid interrupting the ongoing cleanup with SystemExit.


💡 _cleanup_on_exit skips cleanup based only on _shutting_down flag

If signal handler sets _shutting_down=True but cleanup returns None (in-progress from elsewhere), atexit won't call cleanup either, relying on the original caller to complete - which works but could be fragile if caller is interrupted.

Suggested fix:

Consider checking _cleanup_done directly instead of _shutting_down, or call _cleanup_resources unconditionally since it's idempotent anyway.

merobox/commands/manager.py

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 368.6s

🟡 2 warnings, 💡 2 suggestions. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-ed876f94}

merobox/tests/unit/test_docker_manager.py

merobox/commands/binary_manager.py

meroreviewer · 2026-02-18T13:36:38Z

merobox/commands/manager.py

+        # to avoid interrupting the ongoing cleanup with SystemExit.
+        if cleanup_result is not None:
+            sys.exit(0)



💡 Consider using an enum for cleanup state return values

The tri-state return (True/False/None) where None means 'in progress' is non-obvious; an enum like CleanupResult.PERFORMED | ALREADY_DONE | IN_PROGRESS would be more self-documenting.

Suggested fix:

Define a simple `CleanupResult` enum and return its members instead of True/False/None.

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 325.4s

💡 2 suggestions, 📝 2 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-35f721ab}

meroreviewer · 2026-02-18T14:29:08Z

merobox/tests/unit/test_docker_manager.py

+
+    # Verify return value distribution: exactly one PERFORMED, rest ALREADY_DONE
+    performed_count = sum(1 for r in results if r == CleanupResult.PERFORMED)
+    done_count = sum(1 for r in results if r == CleanupResult.ALREADY_DONE)


📝 Nit: Test docstring slightly mischaracterizes RLock behavior

The comment says 'With RLock, the same thread can re-enter, and should get IN_PROGRESS' - but IN_PROGRESS is returned because of the _cleanup_in_progress flag check, not the RLock itself; the RLock actually allows re-entry.

Suggested fix:

Clarify that the RLock allows re-entry but the `_cleanup_in_progress` flag detects it and returns early.

…g async execu Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

- Revert os._exit() to sys.exit() to allow proper stack unwinding and finally block execution (prevents leaking external resources) - Update docstrings to clarify 'at-most-once' execution semantics - Add concurrent cleanup test to verify thread-safety under parallel access - The lock now correctly ensures cleanup runs only once while preserving proper Python cleanup semantics Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

- Change threading.Lock to threading.RLock to prevent deadlock when signal arrives while atexit holds the lock (signal handlers run in same thread) - Remove test_docker_manager_has_cleanup_lock (tests implementation details) - Clean up dead code in concurrent test (unused tracking variables) - Update docstring to document RLock usage Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

- Add _cleanup_in_progress flag to detect re-entrant calls during cleanup - Move entire cleanup operation inside the lock to prevent partial cleanup - Return status from _cleanup_resources (True=done, False=already done, None=in progress) - Signal handler returns without sys.exit() if cleanup is in progress, allowing ongoing cleanup to complete without SystemExit interruption - Add sys.stdout.flush() before os._exit(1) to ensure message is printed - Set _cleanup_done after cleanup completes (not before) for accurate state - Add test for in-progress cleanup detection Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

- Apply same thread-safe cleanup pattern to BinaryManager: - Add RLock, _cleanup_in_progress, _cleanup_done flags - Update signal handler to check cleanup result before sys.exit() - Add sys.stdout.flush() before os._exit(1) - Move cleanup operations inside lock with try/finally - Fix _cleanup_on_exit in both managers to call cleanup unconditionally (it's idempotent and returns immediately if already done/in progress) - Update concurrent test to verify return value distribution: exactly one True (performed cleanup), rest False (already done) Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

- Add CleanupResult enum (PERFORMED, ALREADY_DONE, IN_PROGRESS) to constants.py for self-documenting cleanup return values - Update DockerManager and BinaryManager to use CleanupResult enum - Update DockerManager tests to use CleanupResult enum - Add BinaryManager tests for: - Double cleanup prevention - Concurrent access handling with return value distribution - In-progress state detection Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 86% | Review time: 300.6s

🟡 3 warnings, 💡 2 suggestions, 📝 2 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-933142b1}

meroreviewer · 2026-02-18T15:49:03Z

merobox/tests/unit/test_docker_manager.py

@@ -184,424 +179,114 @@ def test_docker_manager_remove_signal_handlers(mock_docker):



🟡 Deleted tests reduce coverage of graceful shutdown functionality

The PR removes ~15 tests covering SIGTERM handling, graceful stop, batch shutdown, and CORS configuration that still exist in production code, creating regression risk for these paths.

Suggested fix:

Keep the existing tests (test_graceful_stop_container_sends_sigterm, test_stop_node_uses_graceful_shutdown, etc.) and add the new thread-safety tests alongside them.

meroreviewer · 2026-02-18T15:49:05Z

merobox/tests/unit/test_docker_manager.py

@@ -184,424 +179,114 @@ def test_docker_manager_remove_signal_handlers(mock_docker):



🟡 Removal of CORS security validation tests

Security-relevant tests like test_validate_cors_origins_rejects_wildcard, test_validate_cors_origins_rejects_comma, and test_cors_uses_explicit_origins_not_wildcard were deleted; these tests prevent CORS misconfigurations that could allow unauthorized cross-origin access.

Suggested fix:

Restore the CORS validation tests or move them to a separate test file to ensure CORS security properties remain verified.

meroreviewer · 2026-02-18T15:49:07Z

merobox/commands/manager.py

@@ -141,33 +146,54 @@ def _setup_signal_handlers(self):
        atexit.register(self._cleanup_on_exit)


💡 DRY violation: duplicated cleanup logic between managers

The _signal_handler, _cleanup_on_exit, and _cleanup_resources implementations are nearly identical between DockerManager and BinaryManager, including the same instance variables (_cleanup_lock, _cleanup_in_progress, _cleanup_done).

Suggested fix:

Extract the common cleanup coordination logic into a base class or mixin (e.g., CleanupMixin) that both managers inherit from, with an abstract method for the actual resource cleanup.

meroreviewer · 2026-02-18T15:49:09Z

merobox/tests/unit/test_docker_manager.py

@@ -1,16 +1,11 @@
 import signal


🟡 Unrelated test coverage deleted

The PR removed 15+ tests for CORS validation, graceful shutdown, and batch container stopping that are unrelated to the race condition fix; these tests covered functionality that still exists in manager.py.

Suggested fix:

Restore the deleted tests (test_graceful_stop_*, test_cors_*, test_validate_cors_origins_*, test_get_node_hostname, test_default_cors_origins_constant) in a separate file or keep them in this file.

meroreviewer · 2026-02-18T15:49:11Z

merobox/tests/unit/test_docker_manager.py

+    mock_container = MagicMock()
+    manager.nodes = {"node1": mock_container}
+    manager.node_rpc_ports = {"node1": 2528}
+


💡 Concurrent test doesn't verify IN_PROGRESS is never returned

The test asserts exactly 1 PERFORMED and n-1 ALREADY_DONE, but doesn't check that no thread received IN_PROGRESS, which could mask re-entrancy bugs if the lock semantics change.

Suggested fix:

Add assertion: `in_progress_count = sum(1 for r in results if r == CleanupResult.IN_PROGRESS); assert in_progress_count == 0`

meroreviewer · 2026-02-18T15:49:13Z

merobox/commands/binary_manager.py

+            try:
+                if self.processes:
+                    console.print("[cyan]Stopping managed processes...[/cyan]")
+                    for node_name in list(self.processes.keys()):


📝 Nit: cleanup_done set True even on partial failure

The finally block marks cleanup as done regardless of whether an exception occurred during process termination, preventing retry attempts on partial failure.

Suggested fix:

Consider only setting `_cleanup_done = True` outside the exception handler, or document this as intentional fail-fast behavior.

meroreviewer · 2026-02-18T15:49:15Z

merobox/tests/unit/test_binary_manager.py

+
+def test_binary_manager_cleanup_in_progress_returns_in_progress():
+    """Test that re-entrant cleanup call returns IN_PROGRESS when cleanup is running.
+


📝 Nit: Test name/docstring could clarify it tests flag state, not true re-entrancy

The test manually sets _cleanup_in_progress=True without holding the lock, which tests the flag check behavior but not actual re-entrant signal handling during cleanup.

Suggested fix:

Rename to test_cleanup_returns_in_progress_when_flag_set or add a comment clarifying this tests the flag check logic specifically.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-18T15:55:43Z

merobox/commands/binary_manager.py

+                    self.node_rpc_ports.clear()
+            finally:
+                self._cleanup_done = True
+                self._cleanup_in_progress = False


Cleanup done flag set even on exception

Medium Severity

The _cleanup_done flag is set in the finally block, marking cleanup as complete even if an exception occurs during cleanup (e.g., from console.print() if stdout is broken, or from line 154 before any processes are stopped). This prevents future cleanup attempts since subsequent calls return ALREADY_DONE, leaving managed processes or containers running.

Additional Locations (1)

merobox/commands/manager.py#L224-L227

cursor · 2026-02-18T15:55:43Z

merobox/tests/unit/test_docker_manager.py

-    assert "http://localhost:3000" in DEFAULT_CORS_ORIGINS
+    result = manager._cleanup_resources()
+    assert result == CleanupResult.PERFORMED
+    assert mock_container.stop.call_count == 1


Significant test coverage removed without justification

Medium Severity

Approximately 15 tests were removed from test_docker_manager.py, including comprehensive tests for graceful shutdown behavior (_graceful_stop_container, _graceful_stop_containers_batch) and CORS configuration (_validate_cors_origins, _get_node_hostname). The underlying functionality still exists in manager.py, but lost test coverage. This significantly reduces confidence in these features and makes future regressions harder to catch.

- Restore all graceful shutdown tests from master that were lost during rebase - Restore all CORS validation and configuration tests - Add IN_PROGRESS assertion to concurrent access tests to verify no re-entrancy bugs - Rename flag state tests to clarify they test flag check logic specifically Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 323.7s

🟡 1 warnings, 💡 1 suggestions. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-e16e5e50}

meroreviewer · 2026-02-18T16:28:35Z

merobox/commands/binary_manager.py

+            try:
+                if self.processes:
+                    console.print("[cyan]Stopping managed processes...[/cyan]")
+                    for node_name in list(self.processes.keys()):


💡 Cleanup marked done even on partial failure

If an exception occurs during cleanup, _cleanup_done is still set to True in the finally block, meaning subsequent calls return ALREADY_DONE and orphan processes may remain running.

Suggested fix:

Consider only setting _cleanup_done = True after successful cleanup, or catch exceptions and re-raise after clearing _cleanup_in_progress but not setting _cleanup_done.

meroreviewer · 2026-02-18T16:28:37Z

merobox/commands/binary_manager.py

@@ -69,6 +71,9 @@ def __init__(
        self.pid_file_dir = Path("./data/.pids")


🟡 DRY violation: Duplicated cleanup guard pattern

The thread-safe cleanup pattern (_cleanup_lock, _cleanup_in_progress, _cleanup_done, _signal_handler, _cleanup_on_exit) is duplicated nearly identically between BinaryManager and DockerManager; consider extracting to a common base class or mixin.

Suggested fix:

Create a CleanupMixin or BaseManager class containing the signal handling and thread-safe cleanup guard logic, then have both managers inherit from it.

- Create CleanupMixin ABC in cleanup_mixin.py with: - _init_cleanup_state(): Initialize cleanup-related instance variables - _setup_signal_handlers(): Register SIGINT/SIGTERM handlers and atexit - _signal_handler(): Handle signals with graceful shutdown - _cleanup_on_exit(): Atexit handler - _cleanup_resources_guarded(): Thread-safe cleanup wrapper with RLock - _cleanup_resources(): Default implementation calling _do_cleanup() - remove_signal_handlers(): Restore original signal handlers - Abstract _do_cleanup(): Subclasses implement actual cleanup - Update DockerManager to inherit from CleanupMixin: - Call _init_cleanup_state() in __init__ - Implement _do_cleanup() with container graceful shutdown - Override _cleanup_resources() to add drain_timeout/stop_timeout params - Update BinaryManager to inherit from CleanupMixin: - Call _init_cleanup_state() in __init__ - Implement _do_cleanup() with process termination logic - Use default _cleanup_resources() from mixin This eliminates ~90 lines of duplicated code between the managers while maintaining identical behavior and thread-safety guarantees. Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>

meroreviewer

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 83% | Review time: 427.1s

💡 4 suggestions, 📝 2 nitpicks. See inline comments.

_{🤖 Generated by AI Code Reviewer | Review ID: review-23611439}

meroreviewer · 2026-02-18T16:43:08Z

merobox/commands/cleanup_mixin.py

+            cleanup_fn: The cleanup function to execute (typically self._do_cleanup)
+            *args, **kwargs: Arguments to pass to cleanup_fn
+
+        Returns:


💡 Cleanup marked done even on exception - behavior could be documented

The finally block sets _cleanup_done = True even if cleanup_fn raises, meaning failed cleanup won't be retried; this is likely intentional for shutdown robustness but worth documenting.

Suggested fix:

Add a brief comment in the finally block explaining that marking done on failure is intentional to prevent retry loops during shutdown.

meroreviewer · 2026-02-18T16:43:10Z

merobox/commands/cleanup_mixin.py

+
+        Thread-safe guard ensuring at-most-once execution semantics. Uses RLock
+        to allow re-entrant calls from signal handlers in the same thread.
+


💡 Consider logging exceptions in cleanup to prevent silent failures

If cleanup_fn() raises an exception, the cleanup is marked as done via finally but the exception propagates up - callers may not know cleanup partially failed.

Suggested fix:

Add try/except inside _cleanup_resources_guarded to log exceptions before re-raising, or document that partial cleanup exceptions are intentionally propagated.

meroreviewer · 2026-02-18T16:43:12Z

merobox/commands/cleanup_mixin.py

+            CleanupResult.PERFORMED: Cleanup was executed by this call
+            CleanupResult.ALREADY_DONE: Cleanup was already completed previously
+            CleanupResult.IN_PROGRESS: Cleanup is currently in progress (re-entrant call)
+        """


💡 **Abstract method signature uses *args, kwargs but subclasses differ

DockerManager._do_cleanup takes explicit timeout parameters while BinaryManager._do_cleanup takes none; consider using a Protocol or documenting this variance in the abstract method.

Suggested fix:

Add a note to the _do_cleanup docstring that subclasses may define their own signature since _cleanup_resources_guarded passes through arguments.

meroreviewer · 2026-02-18T16:43:15Z

merobox/commands/cleanup_mixin.py

+        cleanup with SystemExit.
+        """
+        if self._shutting_down:
+            console.print("\n[red]Forced exit requested, terminating...[/red]")


💡 Signal handler returns silently when cleanup is IN_PROGRESS

When cleanup returns IN_PROGRESS, the signal handler returns without calling sys.exit(), relying on the existing cleanup to complete; this is correct but the implicit control flow could confuse maintainers.

Suggested fix:

Consider adding a debug log or comment explaining that the process will exit via the ongoing cleanup path.

meroreviewer · 2026-02-18T16:43:17Z

merobox/tests/unit/test_docker_manager.py

+    result = manager._cleanup_resources()
+    assert result == CleanupResult.PERFORMED
+    assert mock_container1.stop.call_count == 1
+    assert manager._cleanup_done is True


📝 Nit: Test docstring references outdated return types

Docstring says 'exactly one True, rest False' but the test actually verifies CleanupResult.PERFORMED and CleanupResult.ALREADY_DONE enum values.

Suggested fix:

Update docstring to: 'Also verifies return value distribution: exactly one PERFORMED, rest ALREADY_DONE.'

meroreviewer · 2026-02-18T16:43:19Z

merobox/tests/unit/test_docker_manager.py

+
+    Spawns multiple threads calling _cleanup_resources simultaneously to verify
+    the lock ensures at-most-once execution under concurrent access.
+    Also verifies return value distribution: exactly one True, rest False.


📝 Nit: Test docstring mentions "True/False" but tests CleanupResult enum

Comment says "exactly one True, rest False" but the test checks for CleanupResult.PERFORMED and CleanupResult.ALREADY_DONE - the BinaryManager equivalent has the correct wording.

Suggested fix:

Update comment to: "Also verifies return value distribution: exactly one PERFORMED, rest ALREADY_DONE."

chefsale marked this pull request as ready for review February 18, 2026 09:50