Skip to content

feat: signal handler race condition#170

Merged
chefsale merged 8 commits intomasterfrom
bounty-fix/fix-race-condition-in-signal-handler-cal-mlrugkmd
Feb 18, 2026
Merged

feat: signal handler race condition#170
chefsale merged 8 commits intomasterfrom
bounty-fix/fix-race-condition-in-signal-handler-cal-mlrugkmd

Conversation

@chefsale
Copy link
Member

@chefsale chefsale commented Feb 18, 2026

refactor: Fix race condition in signal handler calling sys.exit during async execu

This PR addresses a race condition where sys.exit() in signal handlers could corrupt async execution state and prevents double-cleanup by atexit handlers.

The _signal_handler now uses os._exit() for immediate termination without raising SystemExit, and _cleanup_resources is made thread-safe with a lock to ensure it runs only once.


Open in Cursor Open in Web


Note

Medium Risk
Touches shutdown/cleanup paths for both Docker- and process-based node management, so regressions could leave containers/processes running or exit unexpectedly; changes are mitigated by added concurrency-focused unit tests.

Overview
Introduces a shared CleanupMixin used by both DockerManager and BinaryManager to centralize SIGINT/SIGTERM + atexit cleanup, with an RLock-based guard so cleanup runs at most once and reports status via new CleanupResult.

Refactors both managers to implement _do_cleanup() and route cleanup through the guarded mechanism; DockerManager._cleanup_resources() now returns CleanupResult and supports re-entrant calls without raising SystemExit, with new unit tests covering double-cleanup and concurrent cleanup races for both managers.

Written by Cursor Bugbot for commit 82482fb. This will update automatically on new commits. Configure here.

@cursor
Copy link

cursor bot commented Feb 18, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@chefsale chefsale marked this pull request as ready for review February 18, 2026 09:50
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 278.6s

💡 1 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-6608a2d7

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 299.5s

💡 2 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-14782b8b

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 89% | Review time: 237.6s

📝 3 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-54b5b2bd

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 83% | Review time: 242.7s

💡 1 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-8c2d72b5

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 89% | Review time: 183.4s

💡 2 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-5d4e1b4e

def test_docker_manager_cleanup_concurrent_access(mock_docker):
"""Test that _cleanup_resources handles concurrent access correctly.

Spawns multiple threads calling _cleanup_resources simultaneously to verify
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Concurrent test could verify return value distribution

The test verifies cleanup ran once but doesn't capture return values to assert exactly one thread got True and the rest got False.

Suggested fix:

Collect return values from each thread and assert that exactly one is True and the rest are False.

sys.exit(0)
# Only call sys.exit() if cleanup completed or was already done.
# If cleanup was in progress (re-entrant call), return without exit
# to avoid interrupting the ongoing cleanup with SystemExit.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 _cleanup_on_exit skips cleanup based only on _shutting_down flag

If signal handler sets _shutting_down=True but cleanup returns None (in-progress from elsewhere), atexit won't call cleanup either, relying on the original caller to complete - which works but could be fragile if caller is interrupted.

Suggested fix:

Consider checking _cleanup_done directly instead of _shutting_down, or call _cleanup_resources unconditionally since it's idempotent anyway.

@cursor cursor bot force-pushed the bounty-fix/fix-race-condition-in-signal-handler-cal-mlrugkmd branch from 2d3bbbf to ce95709 Compare February 18, 2026 13:29
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 368.6s

🟡 2 warnings, 💡 2 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-ed876f94

# to avoid interrupting the ongoing cleanup with SystemExit.
if cleanup_result is not None:
sys.exit(0)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Consider using an enum for cleanup state return values

The tri-state return (True/False/None) where None means 'in progress' is non-obvious; an enum like CleanupResult.PERFORMED | ALREADY_DONE | IN_PROGRESS would be more self-documenting.

Suggested fix:

Define a simple `CleanupResult` enum and return its members instead of True/False/None.

@chefsale chefsale changed the title Signal handler race condition feat: signal handler race condition Feb 18, 2026
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 325.4s

💡 2 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-35f721ab


# Verify return value distribution: exactly one PERFORMED, rest ALREADY_DONE
performed_count = sum(1 for r in results if r == CleanupResult.PERFORMED)
done_count = sum(1 for r in results if r == CleanupResult.ALREADY_DONE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Test docstring slightly mischaracterizes RLock behavior

The comment says 'With RLock, the same thread can re-enter, and should get IN_PROGRESS' - but IN_PROGRESS is returned because of the _cleanup_in_progress flag check, not the RLock itself; the RLock actually allows re-entry.

Suggested fix:

Clarify that the RLock allows re-entry but the `_cleanup_in_progress` flag detects it and returns early.

cursoragent and others added 6 commits February 18, 2026 15:38
…g async execu

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
- Revert os._exit() to sys.exit() to allow proper stack unwinding and
  finally block execution (prevents leaking external resources)
- Update docstrings to clarify 'at-most-once' execution semantics
- Add concurrent cleanup test to verify thread-safety under parallel access
- The lock now correctly ensures cleanup runs only once while preserving
  proper Python cleanup semantics

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
- Change threading.Lock to threading.RLock to prevent deadlock when signal
  arrives while atexit holds the lock (signal handlers run in same thread)
- Remove test_docker_manager_has_cleanup_lock (tests implementation details)
- Clean up dead code in concurrent test (unused tracking variables)
- Update docstring to document RLock usage

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
- Add _cleanup_in_progress flag to detect re-entrant calls during cleanup
- Move entire cleanup operation inside the lock to prevent partial cleanup
- Return status from _cleanup_resources (True=done, False=already done, None=in progress)
- Signal handler returns without sys.exit() if cleanup is in progress,
  allowing ongoing cleanup to complete without SystemExit interruption
- Add sys.stdout.flush() before os._exit(1) to ensure message is printed
- Set _cleanup_done after cleanup completes (not before) for accurate state
- Add test for in-progress cleanup detection

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
- Apply same thread-safe cleanup pattern to BinaryManager:
  - Add RLock, _cleanup_in_progress, _cleanup_done flags
  - Update signal handler to check cleanup result before sys.exit()
  - Add sys.stdout.flush() before os._exit(1)
  - Move cleanup operations inside lock with try/finally
- Fix _cleanup_on_exit in both managers to call cleanup unconditionally
  (it's idempotent and returns immediately if already done/in progress)
- Update concurrent test to verify return value distribution:
  exactly one True (performed cleanup), rest False (already done)

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
- Add CleanupResult enum (PERFORMED, ALREADY_DONE, IN_PROGRESS) to constants.py
  for self-documenting cleanup return values
- Update DockerManager and BinaryManager to use CleanupResult enum
- Update DockerManager tests to use CleanupResult enum
- Add BinaryManager tests for:
  - Double cleanup prevention
  - Concurrent access handling with return value distribution
  - In-progress state detection

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
@cursor cursor bot force-pushed the bounty-fix/fix-race-condition-in-signal-handler-cal-mlrugkmd branch from 7b0770f to 24cf063 Compare February 18, 2026 15:43
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 86% | Review time: 300.6s

🟡 3 warnings, 💡 2 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-933142b1

@@ -184,424 +179,114 @@ def test_docker_manager_remove_signal_handlers(mock_docker):

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Deleted tests reduce coverage of graceful shutdown functionality

The PR removes ~15 tests covering SIGTERM handling, graceful stop, batch shutdown, and CORS configuration that still exist in production code, creating regression risk for these paths.

Suggested fix:

Keep the existing tests (test_graceful_stop_container_sends_sigterm, test_stop_node_uses_graceful_shutdown, etc.) and add the new thread-safety tests alongside them.

@@ -184,424 +179,114 @@ def test_docker_manager_remove_signal_handlers(mock_docker):

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Removal of CORS security validation tests

Security-relevant tests like test_validate_cors_origins_rejects_wildcard, test_validate_cors_origins_rejects_comma, and test_cors_uses_explicit_origins_not_wildcard were deleted; these tests prevent CORS misconfigurations that could allow unauthorized cross-origin access.

Suggested fix:

Restore the CORS validation tests or move them to a separate test file to ensure CORS security properties remain verified.

@@ -141,33 +146,54 @@ def _setup_signal_handlers(self):
atexit.register(self._cleanup_on_exit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 DRY violation: duplicated cleanup logic between managers

The _signal_handler, _cleanup_on_exit, and _cleanup_resources implementations are nearly identical between DockerManager and BinaryManager, including the same instance variables (_cleanup_lock, _cleanup_in_progress, _cleanup_done).

Suggested fix:

Extract the common cleanup coordination logic into a base class or mixin (e.g., CleanupMixin) that both managers inherit from, with an abstract method for the actual resource cleanup.

@@ -1,16 +1,11 @@
import signal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Unrelated test coverage deleted

The PR removed 15+ tests for CORS validation, graceful shutdown, and batch container stopping that are unrelated to the race condition fix; these tests covered functionality that still exists in manager.py.

Suggested fix:

Restore the deleted tests (test_graceful_stop_*, test_cors_*, test_validate_cors_origins_*, test_get_node_hostname, test_default_cors_origins_constant) in a separate file or keep them in this file.

mock_container = MagicMock()
manager.nodes = {"node1": mock_container}
manager.node_rpc_ports = {"node1": 2528}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Concurrent test doesn't verify IN_PROGRESS is never returned

The test asserts exactly 1 PERFORMED and n-1 ALREADY_DONE, but doesn't check that no thread received IN_PROGRESS, which could mask re-entrancy bugs if the lock semantics change.

Suggested fix:

Add assertion: `in_progress_count = sum(1 for r in results if r == CleanupResult.IN_PROGRESS); assert in_progress_count == 0`

try:
if self.processes:
console.print("[cyan]Stopping managed processes...[/cyan]")
for node_name in list(self.processes.keys()):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: cleanup_done set True even on partial failure

The finally block marks cleanup as done regardless of whether an exception occurred during process termination, preventing retry attempts on partial failure.

Suggested fix:

Consider only setting `_cleanup_done = True` outside the exception handler, or document this as intentional fail-fast behavior.


def test_binary_manager_cleanup_in_progress_returns_in_progress():
"""Test that re-entrant cleanup call returns IN_PROGRESS when cleanup is running.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Test name/docstring could clarify it tests flag state, not true re-entrancy

The test manually sets _cleanup_in_progress=True without holding the lock, which tests the flag check behavior but not actual re-entrant signal handling during cleanup.

Suggested fix:

Rename to test_cleanup_returns_in_progress_when_flag_set or add a comment clarifying this tests the flag check logic specifically.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

self.node_rpc_ports.clear()
finally:
self._cleanup_done = True
self._cleanup_in_progress = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup done flag set even on exception

Medium Severity

The _cleanup_done flag is set in the finally block, marking cleanup as complete even if an exception occurs during cleanup (e.g., from console.print() if stdout is broken, or from line 154 before any processes are stopped). This prevents future cleanup attempts since subsequent calls return ALREADY_DONE, leaving managed processes or containers running.

Additional Locations (1)

Fix in Cursor Fix in Web

assert "http://localhost:3000" in DEFAULT_CORS_ORIGINS
result = manager._cleanup_resources()
assert result == CleanupResult.PERFORMED
assert mock_container.stop.call_count == 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant test coverage removed without justification

Medium Severity

Approximately 15 tests were removed from test_docker_manager.py, including comprehensive tests for graceful shutdown behavior (_graceful_stop_container, _graceful_stop_containers_batch) and CORS configuration (_validate_cors_origins, _get_node_hostname). The underlying functionality still exists in manager.py, but lost test coverage. This significantly reduces confidence in these features and makes future regressions harder to catch.

Fix in Cursor Fix in Web

- Restore all graceful shutdown tests from master that were lost during rebase
- Restore all CORS validation and configuration tests
- Add IN_PROGRESS assertion to concurrent access tests to verify no re-entrancy bugs
- Rename flag state tests to clarify they test flag check logic specifically

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 323.7s

🟡 1 warnings, 💡 1 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-e16e5e50

try:
if self.processes:
console.print("[cyan]Stopping managed processes...[/cyan]")
for node_name in list(self.processes.keys()):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Cleanup marked done even on partial failure

If an exception occurs during cleanup, _cleanup_done is still set to True in the finally block, meaning subsequent calls return ALREADY_DONE and orphan processes may remain running.

Suggested fix:

Consider only setting _cleanup_done = True after successful cleanup, or catch exceptions and re-raise after clearing _cleanup_in_progress but not setting _cleanup_done.

@@ -69,6 +71,9 @@ def __init__(
self.pid_file_dir = Path("./data/.pids")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 DRY violation: Duplicated cleanup guard pattern

The thread-safe cleanup pattern (_cleanup_lock, _cleanup_in_progress, _cleanup_done, _signal_handler, _cleanup_on_exit) is duplicated nearly identically between BinaryManager and DockerManager; consider extracting to a common base class or mixin.

Suggested fix:

Create a CleanupMixin or BaseManager class containing the signal handling and thread-safe cleanup guard logic, then have both managers inherit from it.

- Create CleanupMixin ABC in cleanup_mixin.py with:
  - _init_cleanup_state(): Initialize cleanup-related instance variables
  - _setup_signal_handlers(): Register SIGINT/SIGTERM handlers and atexit
  - _signal_handler(): Handle signals with graceful shutdown
  - _cleanup_on_exit(): Atexit handler
  - _cleanup_resources_guarded(): Thread-safe cleanup wrapper with RLock
  - _cleanup_resources(): Default implementation calling _do_cleanup()
  - remove_signal_handlers(): Restore original signal handlers
  - Abstract _do_cleanup(): Subclasses implement actual cleanup

- Update DockerManager to inherit from CleanupMixin:
  - Call _init_cleanup_state() in __init__
  - Implement _do_cleanup() with container graceful shutdown
  - Override _cleanup_resources() to add drain_timeout/stop_timeout params

- Update BinaryManager to inherit from CleanupMixin:
  - Call _init_cleanup_state() in __init__
  - Implement _do_cleanup() with process termination logic
  - Use default _cleanup_resources() from mixin

This eliminates ~90 lines of duplicated code between the managers while
maintaining identical behavior and thread-safety guarantees.

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 83% | Review time: 427.1s

💡 4 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-23611439

cleanup_fn: The cleanup function to execute (typically self._do_cleanup)
*args, **kwargs: Arguments to pass to cleanup_fn
Returns:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Cleanup marked done even on exception - behavior could be documented

The finally block sets _cleanup_done = True even if cleanup_fn raises, meaning failed cleanup won't be retried; this is likely intentional for shutdown robustness but worth documenting.

Suggested fix:

Add a brief comment in the finally block explaining that marking done on failure is intentional to prevent retry loops during shutdown.

Thread-safe guard ensuring at-most-once execution semantics. Uses RLock
to allow re-entrant calls from signal handlers in the same thread.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Consider logging exceptions in cleanup to prevent silent failures

If cleanup_fn() raises an exception, the cleanup is marked as done via finally but the exception propagates up - callers may not know cleanup partially failed.

Suggested fix:

Add try/except inside _cleanup_resources_guarded to log exceptions before re-raising, or document that partial cleanup exceptions are intentionally propagated.

CleanupResult.PERFORMED: Cleanup was executed by this call
CleanupResult.ALREADY_DONE: Cleanup was already completed previously
CleanupResult.IN_PROGRESS: Cleanup is currently in progress (re-entrant call)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 **Abstract method signature uses *args, kwargs but subclasses differ

DockerManager._do_cleanup takes explicit timeout parameters while BinaryManager._do_cleanup takes none; consider using a Protocol or documenting this variance in the abstract method.

Suggested fix:

Add a note to the _do_cleanup docstring that subclasses may define their own signature since _cleanup_resources_guarded passes through arguments.

cleanup with SystemExit.
"""
if self._shutting_down:
console.print("\n[red]Forced exit requested, terminating...[/red]")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Signal handler returns silently when cleanup is IN_PROGRESS

When cleanup returns IN_PROGRESS, the signal handler returns without calling sys.exit(), relying on the existing cleanup to complete; this is correct but the implicit control flow could confuse maintainers.

Suggested fix:

Consider adding a debug log or comment explaining that the process will exit via the ongoing cleanup path.

result = manager._cleanup_resources()
assert result == CleanupResult.PERFORMED
assert mock_container1.stop.call_count == 1
assert manager._cleanup_done is True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Test docstring references outdated return types

Docstring says 'exactly one True, rest False' but the test actually verifies CleanupResult.PERFORMED and CleanupResult.ALREADY_DONE enum values.

Suggested fix:

Update docstring to: 'Also verifies return value distribution: exactly one PERFORMED, rest ALREADY_DONE.'


Spawns multiple threads calling _cleanup_resources simultaneously to verify
the lock ensures at-most-once execution under concurrent access.
Also verifies return value distribution: exactly one True, rest False.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Test docstring mentions "True/False" but tests CleanupResult enum

Comment says "exactly one True, rest False" but the test checks for CleanupResult.PERFORMED and CleanupResult.ALREADY_DONE - the BinaryManager equivalent has the correct wording.

Suggested fix:

Update comment to: "Also verifies return value distribution: exactly one PERFORMED, rest ALREADY_DONE."

@chefsale chefsale merged commit fd3eb9f into master Feb 18, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants