Skip to content

Conversation

@achimnol
Copy link
Member

@achimnol achimnol commented Jan 5, 2026

Summary

  • Implements automatic tokio runtime cleanup when the last client context exits
  • Uses reference counting to track active client contexts
  • Shutdown completes within the async context to avoid deadlocks
  • Users no longer need to call cleanup_runtime() explicitly

Changes

pyo3-async-runtimes (vendored)

  • Changed RUNTIME_WRAPPER from OnceLock to RwLock<Option<...>> for re-initialization support
  • Added request_shutdown_background() - signals shutdown without blocking
  • Added join_pending_shutdown() - blocks until runtime thread terminates (GIL released)
  • Removed atexit handler approach (caused issues with multi-threaded scenarios)

etcd-client-py

Runtime management (src/runtime.rs):

  • Added SHUTDOWN_TIMEOUT_MS constant (5000ms)
  • Added ACTIVE_CONTEXTS atomic counter for reference counting
  • enter_context() / exit_context() - internal functions to track contexts
  • active_context_count() - public function for debugging/testing
  • _trigger_shutdown() / _join_pending_shutdown() - internal helpers for __aexit__

Client (src/client.rs):

  • __aenter__ increments context count before async work
  • __aexit__ uses a 3-phase shutdown sequence:
    1. Await tokio cleanup task (returns is_last_context flag)
    2. If last context: call _trigger_shutdown() from Python (after tokio task completes)
    3. Await asyncio.to_thread(_join_pending_shutdown) to block until runtime terminates
  • This ensures shutdown happens within the async context, avoiding deadlocks

Tests:

  • Added test_shutdown_stress.py with comprehensive stress tests:
    • test_shutdown_multi_async_tasks - multiple async tasks sharing one event loop
    • test_shutdown_multi_threaded - multiple threads with separate event loops
    • test_shutdown_mixed_concurrency - threads × async tasks (most complex)
  • Added timeout constants for CI stability

Documentation:

  • Updated README with reorganized structure
  • Added "Automatic runtime cleanup" section explaining the mechanism

Key Design Decision

The shutdown trigger is called from Python after the tokio task completes, not from within the task. This avoids a race condition where the runtime could start shutting down while a task is still returning its result to Python.

Test Plan

  • test_single_client_context_count - single client lifecycle
  • test_multiple_concurrent_clients - cleanup only when all clients exit
  • test_nested_contexts_same_client - nested contexts counted separately
  • test_exception_during_context - count decremented on exception
  • test_sequential_clients_reinit - runtime re-initialization works
  • test_no_explicit_cleanup_needed - no segfaults without explicit cleanup
  • test_shutdown_multi_async_tasks - 5 concurrent tasks, 20 iterations
  • test_shutdown_multi_threaded - 4 threads, 10 iterations
  • test_shutdown_mixed_concurrency - 3 threads × 3 tasks, 10 iterations
  • All 24 tests pass on Python 3.11-3.14 (including 3.14t free-threaded)
  • All tests pass on both x86_64 and aarch64

Related

Add fetch-depth: 0 to checkout action to allow fetching submodule
commits that are on non-default branches.
Explicitly set the branch for the pyo3-async-runtimes submodule
to help Git fetch from the correct branch.
The CI was failing on Python 3.14 with:
  undefined symbol: PyUnstable_Module_SetGIL

This symbol only exists in free-threaded Python (3.14t), not in
regular Python 3.14. The issue was that uv may not properly
distinguish between 3.14 and 3.14t when both are available.

Using the +gil variant specifier (e.g., "3.14+gil") explicitly
requests the GIL-enabled Python interpreter, preventing uv from
accidentally selecting the free-threaded variant.
The +gil variant specifier is only for selecting from installed
interpreters, not for installation. uv python install 3.14 should
install the GIL-enabled version by default.
ARM64 + Python 3.14 (GIL-enabled) has a bug where uv's Python build
incorrectly triggers PyO3 to generate free-threaded code, causing
'undefined symbol: PyUnstable_Module_SetGIL' errors at runtime.

This is specific to:
- Platform: ARM64 (aarch64)
- Python: 3.14 (GIL-enabled)

The following combinations work correctly:
- x86_64 + Python 3.14 (GIL-enabled): OK
- ARM64 + Python 3.14t (free-threaded): OK

Excluding this specific combination until the upstream issue is resolved.
Updates the submodule to include fixes for compilation errors in the
upstream PR #71:
- Add tokio `sync` feature for Notify support
- Restore missing public API functions
- Fix stream module function references
Fixes CI build failure caused by deprecated function warnings treated
as errors with RUSTFLAGS="-D warnings".
Fixes stress test performance regression from 8+ minutes to ~7 seconds.

The issue was that request_shutdown() was blocking on thread.join() when
called from within a tokio task (in __aexit__), causing a 5-second
timeout per subprocess iteration.
@achimnol achimnol force-pushed the refactor/automate-cleanup-runtime branch 5 times, most recently from 949538a to 36a0783 Compare January 8, 2026 16:06
@achimnol achimnol force-pushed the refactor/automate-cleanup-runtime branch from 36a0783 to f35221d Compare January 8, 2026 16:24
…nditions

The exit_context() function is called from within a future_into_py block,
which means it runs inside a tokio task. Using the blocking request_shutdown()
could cause race conditions where in-flight tasks try to access a runtime
that is being torn down.

Added new request_shutdown_background() to pyo3-async-runtimes that signals
shutdown without blocking or immediately clearing the runtime slot, allowing
the current task to complete gracefully before the runtime shuts down.
The previous request_shutdown_background() implementation left the runtime
wrapper in storage, causing potential deadlocks. The new implementation:
1. Atomically clears the wrapper from storage (new ops get fresh runtime)
2. Spawns a background thread to properly join the runtime with timeout
3. Avoids blocking the calling async task
…utdown

The previous approach of spawning a detached join thread caused SIGSEGV
because it raced with Python's interpreter shutdown. Now we just signal
shutdown and let the runtime thread complete independently.
Added register_atexit_cleanup(py) call during module init to ensure tokio
runtime threads are properly joined before Python finalizes. This prevents
SIGSEGV crashes on Python 3.11 and 3.12 when tokio threads run during
interpreter shutdown.
Redesigned the shutdown mechanism to ensure the tokio runtime thread is
fully terminated before Python's event loop closes:

1. __aexit__ now returns a Python coroutine that:
   - Awaits the Rust async cleanup (tokio task)
   - If shutdown was triggered, awaits asyncio.to_thread() to block-join
     the runtime thread with GIL released

2. Added _join_pending_shutdown() Python function that:
   - Takes the pending thread handle from storage
   - Joins it with GIL released
   - Is called via asyncio.to_thread() from __aexit__

3. Added comprehensive stress tests for:
   - Multi-async-task scenario (5 concurrent tasks per process)
   - Multi-threaded scenario (4 threads with separate event loops)
   - Mixed concurrency (3 threads × 3 async tasks each)

This approach ensures:
- No atexit dependency - everything completes in async context
- Automatic cleanup when last client exits (ref-counting)
- No deadlocks - blocking happens outside tokio via to_thread
- Thread-safe and async-task-safe

Fixes the SIGSEGV that occurred when tokio threads outlived Python.
The multi-threaded and mixed concurrency tests can take longer on CI
due to thread setup overhead and variable machine performance.

- Add configurable timeout parameter to _run_subprocess_test
- Use 20s timeout for multi-threaded test (4 threads)
- Use 30s timeout for mixed concurrency test (3 threads × 3 tasks)
The previous implementation triggered shutdown from within the tokio task
(exit_context called request_shutdown_background). This created a race
condition where the runtime could start shutting down while the task was
still trying to return its result to Python, causing hangs in multi-threaded
scenarios.

The fix separates the two operations:
1. exit_context() now only returns a flag indicating if this was the last context
2. _trigger_shutdown() is a new function called from Python AFTER the tokio
   task has completed and returned
3. Then _join_pending_shutdown() blocks until the runtime thread terminates

This ensures the tokio task completes successfully before shutdown begins.
- runtime.rs: Add SHUTDOWN_TIMEOUT_MS constant, restrict internal
  functions to pub(crate), simplify docstrings, add section comments
- client.rs: Extract Python wrapper code to AEXIT_WRAPPER_CODE constant,
  simplify __aexit__ method, remove redundant comments
- lib.rs: Reorganize exports with section comments
- test_shutdown_stress.py: Add timeout constants (DEFAULT_TIMEOUT,
  THREADED_TIMEOUT, MIXED_CONCURRENCY_TIMEOUT), simplify embedded scripts

No functional changes - all 24 tests pass.
- Add "Working with key prefixes" subsection under Basic usage
- Move "Automatic runtime cleanup" to its own top-level section
- Add "Lock timeout" and "Lock TTL" subsections under Etcd lock
- Add "Watch with prefix" subsection under Watch
- Condense code quality section
- Fix typo: http::// → http://
achimnol added a commit to lablup/pyo3-async-runtimes that referenced this pull request Jan 9, 2026
This commit introduces graceful shutdown support for the tokio runtime,
addressing PyO3#40.

## Motivation

When Python extensions built with pyo3-async-runtimes are used in
subprocesses or short-lived contexts, tokio tasks may still be running
when Python interpreter finalization begins. This causes fatal errors:

  Fatal Python error: PyGILState_Release: thread state...must be current

This implementation enables proper shutdown coordination, as demonstrated
in lablup/etcd-client-py#17, which uses the new APIs to implement
automatic runtime cleanup through reference counting and async-compatible
shutdown sequences.

## Implementation

The tokio runtime now lives in a dedicated thread (inspired by valkey-glide):

- RuntimeWrapper manages the runtime in a dedicated "pyo3-tokio-runtime" thread
- The runtime is accessed via Handle (thread-safe, cloneable)
- Shutdown is signaled through tokio::sync::Notify and blocks until complete
- Runtime slot is cleared after shutdown, allowing re-initialization

## New APIs

tokio module:
- get_handle() -> Handle: Returns cloneable handle (recommended)
- spawn(fut) / spawn_blocking(f): Convenience spawning functions
- request_shutdown(timeout_ms) -> bool: Blocking shutdown
- request_shutdown_background(timeout_ms) -> bool: Non-blocking shutdown
- join_pending_shutdown(py) -> bool: Join pending background shutdown

async-std module (for API consistency):
- spawn(fut) / spawn_blocking(f): Convenience spawning functions
- request_shutdown(timeout_ms) -> bool: Sets flag only (cannot shut down)

## Deprecated APIs

- tokio::get_runtime(): Cannot be gracefully shut down; use get_handle()

## Dependency Changes

- Replace `futures` with `futures-channel` + `futures-util`
- Add `parking_lot` for RwLock
- Add tokio `sync` feature for Notify

## Macro Updates

- tokio_test macro now uses spawn_blocking() instead of get_runtime()
- tokio_main macro uses #[allow(deprecated)] for block_on() usage

Fixes PyO3#40
Updates vendored pyo3-async-runtimes (PyO3/pyo3-async-runtimes#71) with
cleaner commit history:

1. deps: Replace futures with futures-channel/futures-util, add parking_lot
2. feat(tokio): Add RuntimeWrapper with graceful shutdown support
3. feat(async-std): Add spawn/spawn_blocking/request_shutdown for API consistency
4. refactor(macros): Update to use new spawn_blocking API
5. test: Add shutdown tests and update existing tests for deprecated API
@achimnol achimnol force-pushed the refactor/automate-cleanup-runtime branch from 29c7b4e to f1be35f Compare January 9, 2026 09:45
@achimnol achimnol merged commit 7a0896d into main Jan 9, 2026
17 checks passed
@achimnol achimnol deleted the refactor/automate-cleanup-runtime branch January 9, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants