-
Notifications
You must be signed in to change notification settings - Fork 11
PSv2: Use connection pooling and retries for NATS #1130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
carlosgjs
wants to merge
21
commits into
RolnickLab:main
Choose a base branch
from
uw-ssec:carlosg/natsconn
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+999
−285
Draft
Changes from 19 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
b60eab0
merge
carlos-irreverentlabs 644927f
Merge remote-tracking branch 'upstream/main'
carlosgjs 218f7aa
Merge remote-tracking branch 'upstream/main'
carlosgjs 90da389
Merge remote-tracking branch 'upstream/main'
carlosgjs 842e9b3
PSv2: Use connection pooling and retries for NATS
carlosgjs 227a8db
Refactor and fix nats tests
carlosgjs 2acf620
Tighten formatting
carlosgjs 0632ce0
format
carlosgjs c5f8106
CR feedback
carlosgjs 8805dbe
Apply suggestions from code review
carlosgjs 8618d3c
Merge remote-tracking branch 'upstream/main'
carlosgjs c384199
refactor: simplify NATS connection handling — keep retry decorator, d…
mihow 98a17f1
Merge branch 'main' into carlosg/natsconn
carlosgjs cf42506
revert: restore NATS connection pool — avoid per-operation connection…
mihow dc798ea
refactor: add switchable NATS connection strategies
mihow 4d66c07
refactor: simplify NATS connection module — pool-only, archive original
mihow 41bbeb3
docs: clarify where connection pool provides reuse vs. single-use
mihow ead53d1
fix: use `from None` to suppress noisy exception chain in _get_pool
mihow 9737301
docs: update AGENTS.md test commands to use docker-compose.ci.yml
mihow fa0f84b
fix: correct mock setup in NATS task tests to match plain instantiation
mihow c7b2014
fix: address PR review feedback for NATS connection module
mihow File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,227 @@ | ||
| """ | ||
| NATS connection management for both Celery workers and Django processes. | ||
|
|
||
| Provides a ConnectionPool keyed by event loop. The pool reuses a single NATS | ||
| connection for all async operations *within* one async_to_sync() boundary. | ||
| It does NOT provide reuse across separate async_to_sync() calls — each call | ||
| creates a new event loop, so a new connection is established. | ||
|
|
||
| Where the pool helps: | ||
| The main beneficiary is queue_images_to_nats() in jobs.py, which wraps | ||
| 1000+ publish_task() awaits in a single async_to_sync() call. All of those | ||
| awaits share one event loop and therefore one NATS connection. Without the | ||
| pool, each publish would open its own TCP connection (~1500 per job). | ||
| Similarly, JobViewSet.tasks() batches multiple reserve_task() calls in one | ||
| async_to_sync() boundary. | ||
|
|
||
| Where it doesn't help: | ||
| Single-operation boundaries like _ack_task_via_nats() (one ACK per call) | ||
| get no reuse — the pool is effectively single-use there. The overhead is | ||
| negligible (one dict lookup), and the retry_on_connection_error decorator | ||
| provides resilience regardless. | ||
|
|
||
| Why keyed by event loop: | ||
| asyncio.Lock and nats.Client are bound to the loop they were created on. | ||
| Sharing them across loops causes "attached to a different loop" errors. | ||
| Keying by loop ensures isolation. WeakKeyDictionary auto-cleans when loops | ||
| are garbage collected, so short-lived loops don't leak. | ||
|
|
||
| Archived alternative: | ||
| ContextManagerConnection preserves the original pre-pool implementation | ||
| (one connection per `async with` block) as a drop-in fallback. | ||
| """ | ||
|
|
||
| import asyncio | ||
| import logging | ||
| import threading | ||
| from typing import TYPE_CHECKING | ||
| from weakref import WeakKeyDictionary | ||
|
|
||
| import nats | ||
| from django.conf import settings | ||
| from nats.js import JetStreamContext | ||
|
|
||
| if TYPE_CHECKING: | ||
| from nats.aio.client import Client as NATSClient | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class ConnectionPool: | ||
| """ | ||
| Manages a single persistent NATS connection per event loop. | ||
|
|
||
| This is safe because: | ||
| - asyncio.Lock and NATS Client are bound to the event loop they were created on | ||
| - Each event loop gets its own isolated connection and lock | ||
| - Works correctly with async_to_sync() which creates per-thread event loops | ||
| - Prevents "attached to a different loop" errors in Celery tasks and Django views | ||
|
|
||
| Instantiating TaskQueueManager() is cheap — multiple instances share the same | ||
| underlying connection via this pool. | ||
| """ | ||
|
|
||
| def __init__(self): | ||
| self._nc: "NATSClient | None" = None | ||
| self._js: JetStreamContext | None = None | ||
| self._lock: asyncio.Lock | None = None # Lazy-initialized when needed | ||
|
|
||
| def _ensure_lock(self) -> asyncio.Lock: | ||
| """Lazily create lock bound to current event loop.""" | ||
| if self._lock is None: | ||
| self._lock = asyncio.Lock() | ||
| return self._lock | ||
|
|
||
| async def get_connection(self) -> tuple["NATSClient", JetStreamContext]: | ||
| """ | ||
| Get or create the event loop's NATS connection. Checks connection health | ||
| and recreates if stale. | ||
|
|
||
| Returns: | ||
| Tuple of (NATS connection, JetStream context) | ||
| Raises: | ||
| RuntimeError: If connection cannot be established | ||
| """ | ||
| # Fast path (no lock needed): connection exists, is open, and is connected. | ||
| # This is the hot path — most calls hit this and return immediately. | ||
| if self._nc is not None and self._js is not None and not self._nc.is_closed and self._nc.is_connected: | ||
mihow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return self._nc, self._js | ||
|
|
||
| # Connection is stale or doesn't exist — clear references before reconnecting | ||
| if self._nc is not None: | ||
| logger.warning("NATS connection is closed or disconnected, will reconnect") | ||
| self._nc = None | ||
| self._js = None | ||
|
|
||
| # Slow path: acquire lock to prevent concurrent reconnection attempts | ||
| lock = self._ensure_lock() | ||
| async with lock: | ||
| # Double-check after acquiring lock (another coroutine may have reconnected) | ||
| if self._nc is not None and self._js is not None and not self._nc.is_closed and self._nc.is_connected: | ||
| return self._nc, self._js | ||
|
|
||
| nats_url = settings.NATS_URL | ||
| try: | ||
| logger.info(f"Creating NATS connection to {nats_url}") | ||
| self._nc = await nats.connect(nats_url) | ||
| self._js = self._nc.jetstream() | ||
| logger.info(f"Successfully connected to NATS at {nats_url}") | ||
| return self._nc, self._js | ||
| except Exception as e: | ||
| logger.error(f"Failed to connect to NATS: {e}") | ||
| raise RuntimeError(f"Could not establish NATS connection: {e}") from e | ||
|
|
||
| async def close(self): | ||
| """Close the NATS connection if it exists.""" | ||
| if self._nc is not None and not self._nc.is_closed: | ||
| logger.info("Closing NATS connection") | ||
| await self._nc.close() | ||
| self._nc = None | ||
| self._js = None | ||
|
|
||
| async def reset(self): | ||
| """ | ||
| Close the current connection and clear all state so the next call to | ||
| get_connection() creates a fresh one. | ||
|
|
||
| Called by retry_on_connection_error when an operation hits a connection | ||
| error (e.g. network blip, NATS restart). The lock is also cleared so it | ||
| gets recreated bound to the current event loop. | ||
| """ | ||
| logger.warning("Resetting NATS connection pool due to connection error") | ||
| if self._nc is not None: | ||
| try: | ||
| if not self._nc.is_closed: | ||
| await self._nc.close() | ||
| logger.debug("Successfully closed existing NATS connection during reset") | ||
| except Exception as e: | ||
| # Swallow errors - connection may already be broken | ||
| logger.debug(f"Error closing connection during reset (expected): {e}") | ||
| self._nc = None | ||
| self._js = None | ||
| self._lock = None # Clear lock so new one is created for fresh connection | ||
mihow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| class ContextManagerConnection: | ||
| """ | ||
| Archived pre-pool implementation: one NATS connection per `async with` block. | ||
|
|
||
| This was the original approach before the connection pool was added. It creates | ||
| a fresh connection on get_connection() and expects the caller to close it when | ||
| done. There is no connection reuse and no retry logic at this layer. | ||
|
|
||
| Trade-offs vs ConnectionPool: | ||
| - Simpler: no shared state, no locking, no event-loop keying | ||
| - Expensive: ~1500 TCP connections per 1000-image job vs 1 with the pool | ||
| - No automatic reconnection — caller must handle connection failures | ||
|
|
||
| Kept as a drop-in fallback. To switch, change the class used in | ||
| _create_pool() below from ConnectionPool to ContextManagerConnection. | ||
| """ | ||
|
|
||
| async def get_connection(self) -> tuple["NATSClient", JetStreamContext]: | ||
| """Create a fresh NATS connection.""" | ||
| nats_url = settings.NATS_URL | ||
| try: | ||
| logger.debug(f"Creating per-operation NATS connection to {nats_url}") | ||
| nc = await nats.connect(nats_url) | ||
| js = nc.jetstream() | ||
| return nc, js | ||
| except Exception as e: | ||
| logger.error(f"Failed to connect to NATS: {e}") | ||
| raise RuntimeError(f"Could not establish NATS connection: {e}") from e | ||
|
|
||
| async def close(self): | ||
| """No-op — connections are not tracked.""" | ||
| pass | ||
|
|
||
| async def reset(self): | ||
| """No-op — connections are not tracked.""" | ||
| pass | ||
|
|
||
|
|
||
| # Event-loop-keyed pools: one ConnectionPool per event loop. | ||
| # WeakKeyDictionary automatically cleans up when event loops are garbage collected. | ||
| _pools: WeakKeyDictionary[asyncio.AbstractEventLoop, ConnectionPool] = WeakKeyDictionary() | ||
| _pools_lock = threading.Lock() | ||
|
|
||
|
|
||
| def _get_pool() -> ConnectionPool: | ||
| """Get or create the ConnectionPool for the current event loop.""" | ||
| try: | ||
| loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| raise RuntimeError( | ||
| "get_connection() must be called from an async context with a running event loop. " | ||
| "If calling from sync code, use async_to_sync() to wrap the async function." | ||
| ) from None | ||
|
|
||
| with _pools_lock: | ||
| if loop not in _pools: | ||
| _pools[loop] = ConnectionPool() | ||
| logger.debug(f"Created NATS connection pool for event loop {id(loop)}") | ||
| return _pools[loop] | ||
|
|
||
|
|
||
| async def get_connection() -> tuple["NATSClient", JetStreamContext]: | ||
| """ | ||
| Get or create a NATS connection for the current event loop. | ||
|
|
||
| Returns: | ||
| Tuple of (NATS connection, JetStream context) | ||
| Raises: | ||
| RuntimeError: If called outside of an async context (no running event loop) | ||
| """ | ||
| pool = _get_pool() | ||
| return await pool.get_connection() | ||
|
|
||
|
|
||
| async def reset_connection() -> None: | ||
| """ | ||
| Reset the NATS connection for the current event loop. | ||
|
|
||
| Closes the current connection and clears all state so the next call to | ||
| get_connection() creates a fresh one. | ||
| """ | ||
| pool = _get_pool() | ||
| await pool.reset() | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.