Support cancelling ML async jobs by carlosgjs · Pull Request #1144 · RolnickLab/antenna

carlosgjs · 2026-02-20T22:17:51Z

Summary

This pull request introduces improvements to job cancellation and cleanup logic, particularly for asynchronous jobs using NATS/Redis. It adds better handling for unknown progress states, ensures cleanup routines are consistently invoked, and improves code clarity by renaming and refactoring functions.

Job cancellation and cleanup improvements:

The cancel method in the Job model now calls cleanup_async_job_if_needed to ensure async resources are cleaned up when a job is cancelled.
The cleanup function _cleanup_job_if_needed has been renamed to cleanup_async_job_if_needed and its type signature clarified; all references throughout the codebase have been updated to use the new name.
Handling unknown progress states:
The JobStateProgress class now includes an unknown flag to indicate when progress cannot be determined, such as missing Redis keys.
When progress information is unknown during NATS pipeline result processing, a warning is logged and the task returns early, preventing further processing and retrying as needed.
The _commit_update method returns a JobStateProgress with unknown=True when Redis keys are missing, making this state distinguishable from other errors.

Resource cleanup enhancements:

The cleanup method for async job state now deletes the lock key in addition to other cache keys, ensuring all Redis resources are properly released.

Testing

Created an async ML job with a 1000 image collection
Waited until results started to come in
Clicked cancel
Verified in the debugger and logs that tasks stop being processed

celeryworker-1  | [2026-02-20 17:22:33,724: WARNING/ForkPoolWorker-1] Progress info is unknown for job 111 when processing results. Job may be cancelled.Or this could be a transient Redis error and the NATS task will be retried.

Verified NATS stream/consumers are removed
Job status is set to revoked:

- Verified clicking Retry is able to restart the job

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests. FYI, test_cancel_job() is currently just a stub
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved async job cleanup and cancellation flows to reliably handle edge cases.
- Enhanced handling of unknown job progress states to prevent incomplete updates.
Refactor
- Simplified job cancellation logic for clearer state transitions.
- Optimized task queue stream management to avoid unnecessary resource creation.

netlify · 2026-02-20T22:17:55Z

👷 Deploy request for antenna-ssec pending review.

Visit the deploys page to approve it

Name	Link
🔨 Latest commit	`b43db16`

netlify · 2026-02-20T22:17:57Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`b43db16`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/699deac9576dbc000857980a

coderabbitai · 2026-02-20T22:18:04Z

📝 Walkthrough

Walkthrough

This pull request refactors async job cleanup handling by making the cleanup function public with proper type annotations, enhancing progress tracking with an unknown state indicator, and improving stream existence checks in the NATS queue manager to prevent unnecessary stream creation during task reservation.

Changes

Cohort / File(s)	Summary
Job Cleanup Refactoring `ami/jobs/models.py`, `ami/jobs/tasks.py`	Rename `_cleanup_job_if_needed` to `cleanup_async_job_if_needed` (making it public), add Job type annotation, update all call sites, and integrate cleanup call in cancel() flow before task revocation. Simplify cancel flow by removing conditional branch for missing task_id.
Async Job State Progress `ami/ml/orchestration/async_job_state.py`	Add `unknown: bool` field to JobStateProgress dataclass to distinguish when progress cannot be determined. Update `_commit_update` to return JobStateProgress with unknown=True when Redis keys are missing. Extend cleanup to delete per-job Redis lock key.
NATS Queue Stream Management `ami/ml/orchestration/nats_queue.py`	Introduce `_stream_exists()` helper to check stream existence without side effects. Refactor `_ensure_stream()` to delegate existence checking. Modify `reserve_tasks()` to log debug message and return empty list if stream does not exist, avoiding eager stream creation.
Test Updates `ami/ml/tests.py`	Rename test class from TestTaskStateManager to TestAsyncJobStateManager. Update cleanup test assertion to accept either None progress or progress with unknown state.

Sequence Diagram

sequenceDiagram
    participant Client
    participant JobModel as Job Model
    participant TaskQueue as TaskQueue Manager
    participant AsyncState as AsyncJobState
    participant Redis

    Client->>JobModel: cancel()
    JobModel->>JobModel: set status to CANCELING
    JobModel->>TaskQueue: cleanup_async_job_if_needed(self)
    
    TaskQueue->>Redis: check task exists
    alt Task Found
        TaskQueue->>Redis: revoke task
    else No Task
        TaskQueue->>Redis: skip revocation
    end
    
    TaskQueue->>AsyncState: cleanup progress state
    AsyncState->>Redis: delete stage keys
    AsyncState->>Redis: delete lock key
    
    JobModel->>JobModel: set status to REVOKED
    JobModel->>JobModel: save

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PSv2 cleanup: use is_complete() and dispatch_mode in job progress handler #1125: Directly overlaps with the rename and type annotation of cleanup_async_job_if_needed in ami/jobs/tasks.py and updated call-sites.
PSv2: Implement queue clean-up upon job completion #1113: Modifies the same async-job cleanup flow between ami.jobs.models and ami.jobs.tasks, invoking the cleanup helper during cancellation.
fix: PSv2 follow-up fixes from integration tests #1135: Modifies NATS task reservation logic in ami/ml/orchestration/nats_queue.py with changes to stream existence and creation behavior.

Suggested labels

backend, ml

Suggested reviewers

mihow

Poem

A cleanup most tidy, the rabbit did cheer,
With streams that check first, no creation in fear,
Progress unknown? We handle with grace,
Locks all deleted, jobs in their place. 🐰✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Support cancelling ML async jobs' directly and specifically describes the primary feature added in the changeset: enabling job cancellation for async ML jobs via NATS/Redis.
Description check	✅ Passed	The PR description covers all major required sections: a clear summary, detailed list of changes across multiple files, related issue reference, testing methodology with evidence, and a comprehensive checklist addressing all template requirements.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…o Revoked when cancelling

* fix: prevent NATS connection flooding and stale job task fetching - Add connect_timeout=5, allow_reconnect=False to NATS connections to prevent leaked reconnection loops from blocking Django's event loop - Guard /tasks endpoint against terminal-status jobs (return empty tasks instead of attempting NATS reserve) - IncompleteJobFilter now excludes jobs by top-level status in addition to progress JSON stages - Add stale worker cleanup to integration test script Found during PSv2 integration testing where stale ADC workers with default DataLoader parallelism overwhelmed the single uvicorn worker thread by flooding /tasks with concurrent NATS reserve requests. Co-Authored-By: Claude <noreply@anthropic.com> * docs: PSv2 integration test session notes and NATS flooding findings Session notes from 2026-02-16 integration test including root cause analysis of stale worker task competition and NATS connection issues. Findings doc tracks applied fixes and remaining TODOs with priorities. Co-Authored-By: Claude <noreply@anthropic.com> * docs: update session notes with successful test run #3 PSv2 integration test passed end-to-end (job 1380, 20/20 images). Identified ack_wait=300s as cause of ~5min idle time when GPU processes race for NATS tasks. Co-Authored-By: Claude <noreply@anthropic.com> * fix: batch NATS task fetch to prevent HTTP timeouts Replace N×1 reserve_task() calls with single reserve_tasks() batch fetch. The previous implementation created a new pull subscription per message (320 NATS round trips for batch=64), causing the /tasks endpoint to exceed HTTP client timeouts. The new approach uses one psub.fetch() call for the entire batch. Co-Authored-By: Claude <noreply@anthropic.com> * docs: add next session prompt * feat: add pipeline__slug__in filter for multi-pipeline job queries Workers that handle multiple pipelines can now fetch jobs for all of them in a single request: ?pipeline__slug__in=slug1,slug2 Co-Authored-By: Claude <noreply@anthropic.com> * chore: remove local-only docs and scripts from branch These files are session notes, planning docs, and test scripts that should stay local rather than be part of the PR. Co-Authored-By: Claude <noreply@anthropic.com> * feat: set job dispatch_mode at creation time based on project feature flags ML jobs with a pipeline now get dispatch_mode set during setup() instead of waiting until run() is called by the Celery worker. This lets the UI show the correct mode immediately after job creation. Co-Authored-By: Claude <noreply@anthropic.com> * fix: add timeouts to all JetStream operations and restore reconnect policy Add NATS_JETSTREAM_TIMEOUT (10s) to all JetStream metadata operations via asyncio.wait_for() so a hung NATS connection fails fast instead of blocking the caller's thread indefinitely. Also restore the intended reconnect policy (2 attempts, 1s wait) that was lost in a prior force push. Co-Authored-By: Claude <noreply@anthropic.com> * fix: propagate NATS timeouts as 503 instead of swallowing them asyncio.TimeoutError from _ensure_stream() and _ensure_consumer() was caught by the broad `except Exception` in reserve_tasks(), silently returning [] and making NATS outages indistinguishable from empty queues. Workers would then poll immediately, recreating the flooding problem. - Add explicit `except asyncio.TimeoutError: raise` in reserve_tasks() - Catch TimeoutError and OSError in the /tasks view, return 503 - Restore allow_reconnect=False (fail-fast on connection issues) - Add return type annotation to get_connection() Co-Authored-By: Claude <noreply@anthropic.com> * fix: address review comments (log level, fetch timeout, docstring) - Downgrade reserve_tasks log to DEBUG when zero tasks reserved (avoid log spam from frequent polling) - Pass timeout=0.5 from /tasks endpoint to avoid blocking the worker for 5s on empty queues - Fix docstring examples using string 'job123' for int-typed job_id Co-Authored-By: Claude <noreply@anthropic.com> * fix: catch nats.errors.Error in /tasks endpoint for proper 503 responses NoServersError, ConnectionClosedError, and other NATS exceptions inherit from nats.errors.Error (not OSError), so they escaped the handler and returned 500 instead of 503. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

…olnickLab#1142) * feat: configurable NATS tuning and gunicorn worker management Rebase onto main after RolnickLab#1135 merge. Keep only the additions unique to this branch: - Make TASK_TTR configurable via NATS_TASK_TTR Django setting (default 30s) - Make max_ack_pending configurable via NATS_MAX_ACK_PENDING setting (default 100) - Local dev: switch to gunicorn+UvicornWorker by default for production parity, with USE_UVICORN=1 escape hatch for raw uvicorn - Production: auto-detect WEB_CONCURRENCY from CPU cores (capped at 8) when not explicitly set in the environment Co-Authored-By: Claude <noreply@anthropic.com> * fix: address PR review comments - Fix max_ack_pending falsy-zero guard (use `is not None` instead of `or`) - Update TaskQueueManager docstring with Args section - Simplify production WEB_CONCURRENCY fallback (just use nproc) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

* fix: include pipeline_slug in MinimalJobSerializer (ids_only response) The ADC worker fetches jobs with ids_only=1 and expects pipeline_slug in the response to know which pipeline to run. Without it, Pydantic validation fails and the worker skips the job. Co-Authored-By: Claude <noreply@anthropic.com> * Update ami/jobs/serializers.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: prevent NATS connection flooding and stale job task fetching - Add connect_timeout=5, allow_reconnect=False to NATS connections to prevent leaked reconnection loops from blocking Django's event loop - Guard /tasks endpoint against terminal-status jobs (return empty tasks instead of attempting NATS reserve) - IncompleteJobFilter now excludes jobs by top-level status in addition to progress JSON stages - Add stale worker cleanup to integration test script Found during PSv2 integration testing where stale ADC workers with default DataLoader parallelism overwhelmed the single uvicorn worker thread by flooding /tasks with concurrent NATS reserve requests. Co-Authored-By: Claude <noreply@anthropic.com> * docs: PSv2 integration test session notes and NATS flooding findings Session notes from 2026-02-16 integration test including root cause analysis of stale worker task competition and NATS connection issues. Findings doc tracks applied fixes and remaining TODOs with priorities. Co-Authored-By: Claude <noreply@anthropic.com> * docs: update session notes with successful test run #3 PSv2 integration test passed end-to-end (job 1380, 20/20 images). Identified ack_wait=300s as cause of ~5min idle time when GPU processes race for NATS tasks. Co-Authored-By: Claude <noreply@anthropic.com> * fix: batch NATS task fetch to prevent HTTP timeouts Replace N×1 reserve_task() calls with single reserve_tasks() batch fetch. The previous implementation created a new pull subscription per message (320 NATS round trips for batch=64), causing the /tasks endpoint to exceed HTTP client timeouts. The new approach uses one psub.fetch() call for the entire batch. Co-Authored-By: Claude <noreply@anthropic.com> * docs: add next session prompt * feat: add pipeline__slug__in filter for multi-pipeline job queries Workers that handle multiple pipelines can now fetch jobs for all of them in a single request: ?pipeline__slug__in=slug1,slug2 Co-Authored-By: Claude <noreply@anthropic.com> * chore: remove local-only docs and scripts from branch These files are session notes, planning docs, and test scripts that should stay local rather than be part of the PR. Co-Authored-By: Claude <noreply@anthropic.com> * feat: set job dispatch_mode at creation time based on project feature flags ML jobs with a pipeline now get dispatch_mode set during setup() instead of waiting until run() is called by the Celery worker. This lets the UI show the correct mode immediately after job creation. Co-Authored-By: Claude <noreply@anthropic.com> * fix: add timeouts to all JetStream operations and restore reconnect policy Add NATS_JETSTREAM_TIMEOUT (10s) to all JetStream metadata operations via asyncio.wait_for() so a hung NATS connection fails fast instead of blocking the caller's thread indefinitely. Also restore the intended reconnect policy (2 attempts, 1s wait) that was lost in a prior force push. Co-Authored-By: Claude <noreply@anthropic.com> * fix: propagate NATS timeouts as 503 instead of swallowing them asyncio.TimeoutError from _ensure_stream() and _ensure_consumer() was caught by the broad `except Exception` in reserve_tasks(), silently returning [] and making NATS outages indistinguishable from empty queues. Workers would then poll immediately, recreating the flooding problem. - Add explicit `except asyncio.TimeoutError: raise` in reserve_tasks() - Catch TimeoutError and OSError in the /tasks view, return 503 - Restore allow_reconnect=False (fail-fast on connection issues) - Add return type annotation to get_connection() Co-Authored-By: Claude <noreply@anthropic.com> * fix: address review comments (log level, fetch timeout, docstring) - Downgrade reserve_tasks log to DEBUG when zero tasks reserved (avoid log spam from frequent polling) - Pass timeout=0.5 from /tasks endpoint to avoid blocking the worker for 5s on empty queues - Fix docstring examples using string 'job123' for int-typed job_id Co-Authored-By: Claude <noreply@anthropic.com> * fix: catch nats.errors.Error in /tasks endpoint for proper 503 responses NoServersError, ConnectionClosedError, and other NATS exceptions inherit from nats.errors.Error (not OSError), so they escaped the handler and returned 500 instead of 503. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

…olnickLab#1142) * feat: configurable NATS tuning and gunicorn worker management Rebase onto main after RolnickLab#1135 merge. Keep only the additions unique to this branch: - Make TASK_TTR configurable via NATS_TASK_TTR Django setting (default 30s) - Make max_ack_pending configurable via NATS_MAX_ACK_PENDING setting (default 100) - Local dev: switch to gunicorn+UvicornWorker by default for production parity, with USE_UVICORN=1 escape hatch for raw uvicorn - Production: auto-detect WEB_CONCURRENCY from CPU cores (capped at 8) when not explicitly set in the environment Co-Authored-By: Claude <noreply@anthropic.com> * fix: address PR review comments - Fix max_ack_pending falsy-zero guard (use `is not None` instead of `or`) - Update TaskQueueManager docstring with Args section - Simplify production WEB_CONCURRENCY fallback (just use nproc) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

* fix: include pipeline_slug in MinimalJobSerializer (ids_only response) The ADC worker fetches jobs with ids_only=1 and expects pipeline_slug in the response to know which pipeline to run. Without it, Pydantic validation fails and the worker skips the job. Co-Authored-By: Claude <noreply@anthropic.com> * Update ami/jobs/serializers.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: prevent NATS connection flooding and stale job task fetching - Add connect_timeout=5, allow_reconnect=False to NATS connections to prevent leaked reconnection loops from blocking Django's event loop - Guard /tasks endpoint against terminal-status jobs (return empty tasks instead of attempting NATS reserve) - IncompleteJobFilter now excludes jobs by top-level status in addition to progress JSON stages - Add stale worker cleanup to integration test script Found during PSv2 integration testing where stale ADC workers with default DataLoader parallelism overwhelmed the single uvicorn worker thread by flooding /tasks with concurrent NATS reserve requests. Co-Authored-By: Claude <noreply@anthropic.com> * docs: PSv2 integration test session notes and NATS flooding findings Session notes from 2026-02-16 integration test including root cause analysis of stale worker task competition and NATS connection issues. Findings doc tracks applied fixes and remaining TODOs with priorities. Co-Authored-By: Claude <noreply@anthropic.com> * docs: update session notes with successful test run #3 PSv2 integration test passed end-to-end (job 1380, 20/20 images). Identified ack_wait=300s as cause of ~5min idle time when GPU processes race for NATS tasks. Co-Authored-By: Claude <noreply@anthropic.com> * fix: batch NATS task fetch to prevent HTTP timeouts Replace N×1 reserve_task() calls with single reserve_tasks() batch fetch. The previous implementation created a new pull subscription per message (320 NATS round trips for batch=64), causing the /tasks endpoint to exceed HTTP client timeouts. The new approach uses one psub.fetch() call for the entire batch. Co-Authored-By: Claude <noreply@anthropic.com> * docs: add next session prompt * feat: add pipeline__slug__in filter for multi-pipeline job queries Workers that handle multiple pipelines can now fetch jobs for all of them in a single request: ?pipeline__slug__in=slug1,slug2 Co-Authored-By: Claude <noreply@anthropic.com> * chore: remove local-only docs and scripts from branch These files are session notes, planning docs, and test scripts that should stay local rather than be part of the PR. Co-Authored-By: Claude <noreply@anthropic.com> * feat: set job dispatch_mode at creation time based on project feature flags ML jobs with a pipeline now get dispatch_mode set during setup() instead of waiting until run() is called by the Celery worker. This lets the UI show the correct mode immediately after job creation. Co-Authored-By: Claude <noreply@anthropic.com> * fix: add timeouts to all JetStream operations and restore reconnect policy Add NATS_JETSTREAM_TIMEOUT (10s) to all JetStream metadata operations via asyncio.wait_for() so a hung NATS connection fails fast instead of blocking the caller's thread indefinitely. Also restore the intended reconnect policy (2 attempts, 1s wait) that was lost in a prior force push. Co-Authored-By: Claude <noreply@anthropic.com> * fix: propagate NATS timeouts as 503 instead of swallowing them asyncio.TimeoutError from _ensure_stream() and _ensure_consumer() was caught by the broad `except Exception` in reserve_tasks(), silently returning [] and making NATS outages indistinguishable from empty queues. Workers would then poll immediately, recreating the flooding problem. - Add explicit `except asyncio.TimeoutError: raise` in reserve_tasks() - Catch TimeoutError and OSError in the /tasks view, return 503 - Restore allow_reconnect=False (fail-fast on connection issues) - Add return type annotation to get_connection() Co-Authored-By: Claude <noreply@anthropic.com> * fix: address review comments (log level, fetch timeout, docstring) - Downgrade reserve_tasks log to DEBUG when zero tasks reserved (avoid log spam from frequent polling) - Pass timeout=0.5 from /tasks endpoint to avoid blocking the worker for 5s on empty queues - Fix docstring examples using string 'job123' for int-typed job_id Co-Authored-By: Claude <noreply@anthropic.com> * fix: catch nats.errors.Error in /tasks endpoint for proper 503 responses NoServersError, ConnectionClosedError, and other NATS exceptions inherit from nats.errors.Error (not OSError), so they escaped the handler and returned 500 instead of 503. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

Copilot

Pull request overview

This PR enhances the cancellation mechanism for asynchronous ML jobs that use NATS/Redis for distributed processing. It improves resource cleanup, adds better handling for edge cases when progress tracking state is unavailable, and refactors the cleanup function for better clarity.

Changes:

Introduced an unknown flag in JobStateProgress to distinguish missing Redis keys from lock contention
Enhanced job cancellation to call cleanup for async resources and properly handle status transitions
Refactored _cleanup_job_if_needed to cleanup_async_job_if_needed with explicit type signature

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
ami/ml/tests.py	Renamed test class to `TestAsyncJobStateManager` and updated cleanup test assertion to handle unknown progress state
ami/ml/orchestration/nats_queue.py	Refactored stream existence checking into separate `_stream_exists` method and added early return in `reserve_tasks` when stream doesn't exist
ami/ml/orchestration/async_job_state.py	Added `unknown` field to `JobStateProgress`, updated `_commit_update` to return unknown state for missing keys, added lock key cleanup
ami/jobs/tasks.py	Added handling for unknown progress state with early return, renamed cleanup function and added type hint
ami/jobs/models.py	Updated `cancel` method to call cleanup before revoking task and unconditionally set status to REVOKED

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T18:19:39Z

ami/jobs/models.py

+        # For sync jobs the task revoke will update the job status. However, for async jobs we need to set the status
+        # to revoked here since the task already finished (it only queues the images)
+        self.status = JobState.REVOKED
+        self.save()


The unconditional status change to REVOKED on lines 981-982 could create a race condition with the task_postrun signal handler. For sync jobs with task_id, task.revoke(terminate=True) triggers the task_postrun signal which calls update_job_status and sets status to REVOKED. However, if the signal handler runs between line 978 and line 981, this code would still overwrite any status the signal handler set. Consider checking the dispatch mode or current status before unconditionally setting to REVOKED, or removing the duplicate status update for jobs with task_id.

Suggested change

# For sync jobs the task revoke will update the job status. However, for async jobs we need to set the status

# to revoked here since the task already finished (it only queues the images)

self.status = JobState.REVOKED

self.save()

# For sync jobs the task revoke will update the job status via the task_postrun signal.

# However, for async jobs we need to set the status to revoked here since the task already

# finished (it only queues the images).

if not self.task_id or self.dispatch_mode == JobDispatchMode.ASYNC_API:

self.status = JobState.REVOKED

self.save()

Copilot · 2026-02-24T18:19:40Z

ami/ml/tests.py


-class TestTaskStateManager(TestCase):
+class TestAsyncJobStateManager(TestCase):
    """Test TaskStateManager for job progress tracking."""


The docstring still refers to "TaskStateManager" but the class has been renamed to "TestAsyncJobStateManager". Update the docstring to match the new class name for consistency.

Suggested change

"""Test TaskStateManager for job progress tracking."""

"""Test AsyncJobStateManager for job progress tracking."""

Copilot · 2026-02-24T18:19:40Z

ami/jobs/tasks.py

+    if progress_info.unknown:
+        logger.warning(
+            f"Progress info is unknown for job {job_id} when processing results. Job may be cancelled."
+            f"Or this could be a transient Redis error and the NATS task will be retried."


Missing space between sentences in the log message. The second f-string should start with a space: " Or this could be a transient Redis error..." to properly separate it from the previous sentence.

Suggested change

f"Or this could be a transient Redis error and the NATS task will be retried."

f" Or this could be a transient Redis error and the NATS task will be retried."

Copilot · 2026-02-24T18:19:41Z

ami/jobs/tasks.py


        # update complete state based on latest progress info after saving results
        complete_state = JobState.SUCCESS
        if progress_info.total > 0 and (progress_info.failed / progress_info.total) > FAILURE_THRESHOLD:


After the second update_state call for the "results" stage, there is no check for progress_info.unknown like there is for the "process" stage (lines 97-102). If the job is cancelled between the two update_state calls, progress_info could have unknown=True, and line 172 would attempt to access progress_info.total on a JobStateProgress with default values (total=0). While this might not crash, it could lead to incorrect behavior. Consider adding a similar check for progress_info.unknown after line 168.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

ami/ml/orchestration/nats_queue.py (1)
118-122: Consider downgrading the log level to info.

A stream not existing before creation is expected on the publish path (the first task for every new job). logger.warning suggests something abnormal, but this is normal first-publish behavior. logger.info would be more appropriate.
Proposed fix
-            logger.warning(f"Stream {stream_name} does not exist")
+            logger.info(f"Stream {stream_name} does not exist, creating it")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/ml/orchestration/nats_queue.py` around lines 118 - 122, The log in the
publish path uses logger.warning when a stream is absent, but that is expected
behavior; change the call in the branch where _stream_exists(job_id) returns
False to use logger.info instead of logger.warning (the block that computes
stream_name via _get_stream_name(job_id) and subject via _get_subject(job_id)).
Keep the same message text and context, only lowering the level to info so
first-publish behavior is not signaled as a warning.
ami/ml/orchestration/async_job_state.py (1)
159-174: Return type annotation is now slightly inaccurate.

With the change at lines 173–174, _commit_update can no longer return None — it always returns a JobStateProgress. The signature -> JobStateProgress | None is stale. This is cosmetic, but keeping annotations accurate helps tooling and future readers.
Proposed fix
     def _commit_update(
         self,
         processed_image_ids: set[str],
         stage: str,
         failed_image_ids: set[str] | None = None,
-    ) -> JobStateProgress | None:
+    ) -> JobStateProgress:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/ml/orchestration/async_job_state.py` around lines 159 - 174, The return
type annotation of _commit_update is stale: it currently declares "->
JobStateProgress | None" but the implementation always returns a
JobStateProgress (e.g., returns JobStateProgress(unknown=True) when
pending_images or total_images is None). Update the function signature of
_commit_update to "-> JobStateProgress" and ensure any associated docstring or
type hints that reference a possible None are updated accordingly; locate the
method named _commit_update and the JobStateProgress type to make this change.
ami/jobs/models.py (1)
967-982: Cancel flow looks correct for both sync and async jobs.

The ordering (CANCELING → cleanup → revoke → REVOKED) is sound. For async jobs, the Celery task is already done so cleanup runs first; for non-async jobs, cleanup_async_job_if_needed is a no-op.

One concern: if cleanup_async_job_if_needed raises an unexpected exception, the job stays stuck in CANCELING and never reaches REVOKED. Consider wrapping the cleanup call in a try/except to ensure the status always transitions to REVOKED.
Proposed defensive wrapper
         self.status = JobState.CANCELING
         self.save()
 
-        cleanup_async_job_if_needed(self)
+        try:
+            cleanup_async_job_if_needed(self)
+        except Exception:
+            logger.exception(f"Failed to clean up async resources for job {self.pk}, proceeding with cancellation")
         if self.task_id:
             task = run_job.AsyncResult(self.task_id)
             if task:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/models.py` around lines 967 - 982, The call to
cleanup_async_job_if_needed(self) in cancel() can raise and leave the job stuck
in CANCELING; wrap that call in a try/except Exception block so any exception is
caught and logged (use the module logger or a suitable logging mechanism, e.g.,
logger.exception) and then continue with the revoke and final status update;
ensure the code always sets self.status = JobState.REVOKED and calls self.save()
in a finally-like path so REVOKED is persisted even if cleanup fails, while
keeping the existing revoke logic using run_job.AsyncResult(self.task_id)
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 97-102: The logger.warning message inside the
progress_info.unknown branch concatenates two adjacent f-strings and drops the
space between sentences; update the warning in the block that checks
progress_info.unknown (where job_id is referenced) so the two sentences are
separated (e.g., add a trailing space to the first f-string, a leading space on
the second, or merge into one f-string) ensuring the logged text reads
"...cancelled. Or this could..." when calling logger.warning.

In `@ami/ml/orchestration/nats_queue.py`:
- Around line 98-111: The _stream_exists method currently catches all exceptions
and treats them as "stream missing", which masks real NATS errors; update the
error handling in _stream_exists (which calls self.js.stream_info and uses
_get_stream_name and NATS_JETSTREAM_TIMEOUT) to specifically catch the
NotFoundError from the nats client (nats.js.errors.NotFoundError) and return
False only in that case, keep the existing asyncio.TimeoutError re-raise, and
let any other exceptions propagate (i.e., remove the broad except Exception and
replace it with except NotFoundError: return False).

---

Nitpick comments:
In `@ami/jobs/models.py`:
- Around line 967-982: The call to cleanup_async_job_if_needed(self) in cancel()
can raise and leave the job stuck in CANCELING; wrap that call in a try/except
Exception block so any exception is caught and logged (use the module logger or
a suitable logging mechanism, e.g., logger.exception) and then continue with the
revoke and final status update; ensure the code always sets self.status =
JobState.REVOKED and calls self.save() in a finally-like path so REVOKED is
persisted even if cleanup fails, while keeping the existing revoke logic using
run_job.AsyncResult(self.task_id) unchanged.

In `@ami/ml/orchestration/async_job_state.py`:
- Around line 159-174: The return type annotation of _commit_update is stale: it
currently declares "-> JobStateProgress | None" but the implementation always
returns a JobStateProgress (e.g., returns JobStateProgress(unknown=True) when
pending_images or total_images is None). Update the function signature of
_commit_update to "-> JobStateProgress" and ensure any associated docstring or
type hints that reference a possible None are updated accordingly; locate the
method named _commit_update and the JobStateProgress type to make this change.

In `@ami/ml/orchestration/nats_queue.py`:
- Around line 118-122: The log in the publish path uses logger.warning when a
stream is absent, but that is expected behavior; change the call in the branch
where _stream_exists(job_id) returns False to use logger.info instead of
logger.warning (the block that computes stream_name via _get_stream_name(job_id)
and subject via _get_subject(job_id)). Keep the same message text and context,
only lowering the level to info so first-publish behavior is not signaled as a
warning.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e06dd22 and b43db16.

📒 Files selected for processing (5)

ami/jobs/models.py
ami/jobs/tasks.py
ami/ml/orchestration/async_job_state.py
ami/ml/orchestration/nats_queue.py
ami/ml/tests.py

coderabbitai · 2026-02-24T18:21:03Z

ami/jobs/tasks.py

+    if progress_info.unknown:
+        logger.warning(
+            f"Progress info is unknown for job {job_id} when processing results. Job may be cancelled."
+            f"Or this could be a transient Redis error and the NATS task will be retried."
+        )
+        return


⚠️ Potential issue | 🟡 Minor

Missing space between implicitly concatenated f-strings.

Lines 99–100 use adjacent f-strings that Python concatenates without any separator. The result is "...cancelled.Or this could..." — missing a space after the period.

Proposed fix

if progress_info.unknown: logger.warning( f"Progress info is unknown for job {job_id} when processing results. Job may be cancelled." - f"Or this could be a transient Redis error and the NATS task will be retried." + f" Or this could be a transient Redis error and the NATS task will be retried." ) return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ami/jobs/tasks.py` around lines 97 - 102, The logger.warning message inside the progress_info.unknown branch concatenates two adjacent f-strings and drops the space between sentences; update the warning in the block that checks progress_info.unknown (where job_id is referenced) so the two sentences are separated (e.g., add a trailing space to the first f-string, a leading space on the second, or merge into one f-string) ensuring the logged text reads "...cancelled. Or this could..." when calling logger.warning.

coderabbitai · 2026-02-24T18:21:03Z

ami/ml/orchestration/nats_queue.py

+    async def _stream_exists(self, job_id: int) -> bool:
+        """Check if stream exists for the given job."""
        if self.js is None:
            raise RuntimeError("Connection is not open. Use TaskQueueManager as an async context manager.")

        stream_name = self._get_stream_name(job_id)
-        subject = self._get_subject(job_id)
-
        try:
            await asyncio.wait_for(self.js.stream_info(stream_name), timeout=NATS_JETSTREAM_TIMEOUT)
            logger.debug(f"Stream {stream_name} already exists")
+            return True
        except asyncio.TimeoutError:
            raise  # NATS unreachable — let caller handle it rather than creating a stream blindly
-        except Exception as e:
-            logger.warning(f"Stream {stream_name} does not exist: {e}")
+        except Exception:
+            return False


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check what exception nats.py raises for missing streams cd /tmp && pip install nats-py >/dev/null 2>&1 # Try to find the exception types in nats-py python3 << 'EOF' try: # Try different import paths try: from nats.js.errors import NotFoundError print("Found: nats.js.errors.NotFoundError") except ImportError: pass try: from nats.errors import NotFoundError print("Found: nats.errors.NotFoundError") except ImportError: pass # List all exceptions in nats module import nats import inspect # Check nats.js module try: import nats.js print("\n=== nats.js exceptions ===") for name in dir(nats.js): obj = getattr(nats.js, name) if inspect.isclass(obj) and issubclass(obj, BaseException): print(f" {name}") except Exception as e: print(f"Error inspecting nats.js: {e}") # Check for errors submodule try: import nats.js.errors as errors print("\n=== nats.js.errors exceptions ===") for name in dir(errors): if not name.startswith('_'): obj = getattr(errors, name) if inspect.isclass(obj) and issubclass(obj, BaseException): print(f" {name}") except Exception as e: print(f"Error inspecting nats.js.errors: {e}") except Exception as e: print(f"Error: {e}") import traceback traceback.print_exc() EOF

Repository: RolnickLab/antenna

Length of output: 712

🏁 Script executed:

# Also check the actual file to see the context cat -n ami/ml/orchestration/nats_queue.py | sed -n '95,135p'

Repository: RolnickLab/antenna

Length of output: 2088

🏁 Script executed:

# Check for other usages of stream_info or exception handling in the file rg "stream_info|except" ami/ml/orchestration/nats_queue.py -n

Repository: RolnickLab/antenna

Length of output: 665

Narrow exception handling to avoid masking transient NATS errors.

The nats.js library provides nats.js.errors.NotFoundError for missing streams. The current broad except Exception catch treats all non-timeout errors as "stream doesn't exist", which could mask transient NATS failures like ServiceUnavailableError or ServerError. This could cause reserve_tasks to silently return an empty list or _ensure_stream to unnecessarily retry stream creation when the real issue is a temporary NATS connectivity problem.

Catch nats.js.errors.NotFoundError specifically and let other exceptions propagate to surface real NATS issues.

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 101-101: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 107-107: Consider moving this statement to an else block

(TRY300)

[warning] 110-110: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ami/ml/orchestration/nats_queue.py` around lines 98 - 111, The _stream_exists method currently catches all exceptions and treats them as "stream missing", which masks real NATS errors; update the error handling in _stream_exists (which calls self.js.stream_info and uses _get_stream_name and NATS_JETSTREAM_TIMEOUT) to specifically catch the NotFoundError from the nats client (nats.js.errors.NotFoundError) and return False only in that case, keep the existing asyncio.TimeoutError re-raise, and let any other exceptions propagate (i.e., remove the broad except Exception and replace it with except NotFoundError: return False).

carlos-irreverentlabs and others added 8 commits January 16, 2026 11:25

merge

b60eab0

Merge remote-tracking branch 'upstream/main'

644927f

Merge remote-tracking branch 'upstream/main'

218f7aa

Merge remote-tracking branch 'upstream/main'

90da389

Merge remote-tracking branch 'upstream/main'

8618d3c

Merge remote-tracking branch 'upstream/main'

bd1be5f

Merge remote-tracking branch 'upstream/main'

b102ae1

PSv2: Support job cancelling

b43b615

carlosgjs changed the title ~~Carlos/asynccancel3~~ Support cancelling ML async jobs Feb 20, 2026

carlosgjs requested a review from mihow February 20, 2026 22:18

carlosgjs and others added 11 commits February 20, 2026 14:41

Don't re-recreate the stream when reserving tasks, always set state t…

9827ed2

…o Revoked when cancelling

Merge remote-tracking branch 'upstream/main'

883c4f8

Merge branch 'main' into carlos/asynccancel3

4917775

Fix tests

b43db16

carlosgjs marked this pull request as ready for review February 24, 2026 18:15

Copilot AI review requested due to automatic review settings February 24, 2026 18:15

Copilot started reviewing on behalf of carlosgjs February 24, 2026 18:15 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

	"""Test TaskStateManager for job progress tracking."""
	"""Test AsyncJobStateManager for job progress tracking."""

	f"Or this could be a transient Redis error and the NATS task will be retried."
	f" Or this could be a transient Redis error and the NATS task will be retried."

Comments

Conversation

carlosgjs commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

netlify bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👷 Deploy request for antenna-ssec pending review.

Uh oh!

netlify bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carlosgjs commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

netlify bot commented Feb 20, 2026 •

edited

Loading

netlify bot commented Feb 20, 2026 •

edited

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading