-
Notifications
You must be signed in to change notification settings - Fork 11
PSv2: Async backend docs and diagram #1137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
carlosgjs
wants to merge
35
commits into
RolnickLab:main
Choose a base branch
from
uw-ssec:carlos/trackcounts
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 33 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
b60eab0
merge
carlos-irreverentlabs 644927f
Merge remote-tracking branch 'upstream/main'
carlosgjs 218f7aa
Merge remote-tracking branch 'upstream/main'
carlosgjs 867201d
Update ML job counts in async case
carlosgjs 90da389
Merge remote-tracking branch 'upstream/main'
carlosgjs cdf57ea
Update date picker version and tweak layout logic (#1105)
annavik 6837ad6
fix: Properly handle async job state with celery tasks (#1114)
carlosgjs f1cd62d
PSv2: Implement queue clean-up upon job completion (#1113)
carlosgjs 74df9ea
fix: PSv2: Workers should not try to fetch tasks from v1 jobs (#1118)
carlosgjs 4a082d3
PSv2 cleanup: use is_complete() and dispatch_mode in job progress han…
mihow 9d560cf
Merge branch 'main' into carlos/trackcounts
carlosgjs e43536b
track captures and failures
carlosgjs 50df5f6
Update tests, CR feedback, log error images
carlosgjs 3287fe2
CR feedback
carlosgjs a87b05a
fix type checking
carlosgjs 89bf950
Merge remote-tracking branch 'origin/main' into carlos/trackcounts
mihow a5ff6f8
refactor: rename _get_progress to _commit_update in TaskStateManager
mihow 337b7fc
fix: unify FAILURE_THRESHOLD and convert TaskProgress to dataclass
mihow 8618d3c
Merge remote-tracking branch 'upstream/main'
carlosgjs 4331dee
Merge branch 'main' into carlos/trackcounts
carlosgjs afee6e7
refactor: rename TaskProgress to JobStateProgress
mihow 65d77cb
docs: update NATS todo and planning docs with session learnings
mihow 8e8cd80
Rename TaskStateManager to AsyncJobStateManager
carlosgjs 34af787
Merge branch 'carlos/trackcounts' of github.com:uw-ssec/antenna into …
carlosgjs afc4472
Track results counts in the job itself vs Redis
carlosgjs b6c3c6a
small simplification
carlosgjs b15024f
Reset counts to 0 on reset
carlosgjs b2e4a72
chore: remove local planning docs from PR branch
mihow a15ebda
docs: clarify three-layer job state architecture in docstrings
mihow 14f6a63
Add async ML backend diagrma
carlosgjs bd1be5f
Merge remote-tracking branch 'upstream/main'
carlosgjs 02cdc8d
Merge branch 'main' into carlos/trackcounts
carlosgjs 1c559ce
cleanup
carlosgjs dcd27c0
Apply suggestions from code review
carlosgjs 0830b78
update path
carlosgjs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| # Async ML Backend Architecture | ||
|
|
||
| This document describes how async ML jobs work in the Antenna system, showing the flow of data between Django, Celery workers, NATS JetStream, and external ML processing workers. | ||
|
|
||
| ## System Components | ||
|
|
||
| - **Django**: Web application serving the REST API | ||
| - **Postgres**: Database for persistent storage | ||
| - **Celery Worker**: Background task processor for job orchestration | ||
| - **NATS JetStream**: Distributed task queue with acknowledgment support | ||
| - **ML Worker**: External processing service that runs ML models on images | ||
| - **Redis**: State management for job progress tracking | ||
|
|
||
| ## Async Job Flow | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| autonumber | ||
|
|
||
| participant User | ||
| participant Django | ||
| participant Celery | ||
| participant Redis | ||
| participant NATS | ||
| participant Worker as ML Worker | ||
| participant DB as Postgres | ||
|
|
||
| %% Job Creation and Setup | ||
| User->>Django: POST /jobs/ (create job) | ||
| Django->>DB: Create Job instance | ||
| User->>Django: POST /jobs/{id}/run/ | ||
| Django->>Celery: enqueue run_job.delay(job_id) | ||
| Note over Django,Celery: Celery task queued | ||
|
|
||
| %% Job Execution and Image Queuing | ||
| Celery->>Celery: Collect images to process | ||
| Note over Celery: Set dispatch_mode = ASYNC_API | ||
|
|
||
| Celery->>Redis: Initialize job state (image IDs) | ||
|
|
||
| loop For each image | ||
| Celery->>NATS: publish_task(job_id, PipelineProcessingTask) | ||
| Note over NATS: Task contains:<br/>- image_id<br/>- image_url<br/>- reply_subject (for ACK) | ||
| end | ||
|
|
||
| Note over Celery: Celery task completes<br/>(images queued, not processed) | ||
|
|
||
| %% Worker Processing Loop | ||
| loop Worker polling loop | ||
| Worker->>Django: GET /jobs/{id}/tasks?batch=10 | ||
| Django->>NATS: reserve_task(job_id) x10 | ||
| NATS-->>Django: PipelineProcessingTask[] (with reply_subjects) | ||
| Django-->>Worker: {"tasks": [...]} | ||
|
|
||
| %% Process each task | ||
| loop For each task | ||
| Worker->>Worker: Download image from image_url | ||
| Worker->>Worker: Run ML model (detection/classification) | ||
| Worker->>Django: POST /jobs/{id}/result/<br/>PipelineTaskResult{<br/> reply_subject,<br/> result: PipelineResultsResponse<br/>} | ||
|
|
||
| %% Result Processing | ||
| Django->>Celery: process_nats_pipeline_result.delay(<br/>job_id, result_data, reply_subject) | ||
| Django-->>Worker: {"status": "queued", "task_id": "..."} | ||
carlosgjs marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| %% Celery processes result | ||
| Celery->>Redis: Acquire lock for job | ||
| Celery->>Redis: Update pending images (remove processed) | ||
| Celery->>Redis: Calculate progress percentage | ||
| Redis-->>Celery: JobStateProgress | ||
|
|
||
| Celery->>DB: Update job.progress (process stage) | ||
| Celery->>DB: Save detections, classifications | ||
| Celery->>NATS: acknowledge_task(reply_subject) | ||
| Note over NATS: Task removed from queue<br/>(won't be redelivered) | ||
|
|
||
| Celery->>Redis: Update pending images (results stage) | ||
| Celery->>DB: Update job.progress (results stage) | ||
| Celery->>Redis: Release lock | ||
| end | ||
| end | ||
| ``` | ||
|
|
||
| ## Key Design Decisions | ||
|
|
||
| ### 1. Asynchronous Task Queue (NATS JetStream) | ||
|
|
||
| - **Why NATS?** Supports disconnected pull model - workers don't need persistent connections | ||
| - **Visibility Timeout (TTR)**: 300 seconds (5 minutes) - tasks auto-requeue if not ACK'd | ||
| - **Max Retries**: 5 attempts before giving up on a task | ||
| - **Per-Job Streams**: Each job gets its own stream (`job_{job_id}`) for isolation | ||
|
|
||
| ### 2. Redis-Based State Management | ||
|
|
||
| - **Purpose**: Track pending images across distributed workers | ||
| - **Atomicity**: Uses distributed locks to prevent race conditions | ||
| - **Lock Duration**: 360 seconds (matches Celery task timeout) | ||
| - **Cleanup**: Automatic cleanup when job completes | ||
|
|
||
| ### 3. Reply Subject for Acknowledgment | ||
|
|
||
| - NATS generates unique `reply_subject` for each reserved task | ||
| - Worker receives `reply_subject` in task data | ||
| - Worker includes `reply_subject` in result POST | ||
| - Celery acknowledges via NATS after successful save | ||
|
|
||
| This pattern enables: | ||
|
|
||
| - Workers don't need direct NATS access | ||
| - HTTP-only communication for workers | ||
| - Proper task acknowledgment through Django API | ||
|
|
||
| ### 4. Error Handling | ||
|
|
||
| **Worker Errors:** | ||
|
|
||
| - Worker posts `PipelineResultsError` instead of `PipelineResultsResponse` | ||
| - Error is logged but task is still ACK'd (prevents infinite retries for bad data) | ||
| - Failed images tracked separately in Redis | ||
| - If worker crashes and never reports a result or error NATS will redeliver after visibility timeout | ||
|
|
||
| **Database Errors:** | ||
|
|
||
| - If `save_results()` fails, task is NOT ACK'd | ||
| - NATS will redeliver after visibility timeout | ||
| - Celery task has no retries (relies on NATS retry mechanism) | ||
|
|
||
| **Job Cancellation:** | ||
|
|
||
| - Celery task terminated immediately | ||
| - NATS stream and consumer deleted | ||
| - Redis state cleaned up | ||
|
|
||
| ## API Endpoints | ||
|
|
||
| ### GET /jobs/{id}/tasks | ||
|
|
||
carlosgjs marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Worker endpoint to fetch tasks from NATS queue. | ||
|
|
||
| **Query Parameters:** | ||
|
|
||
| - `batch`: Number of tasks to fetch | ||
|
|
||
| **Response:** | ||
|
|
||
| ```json | ||
| { | ||
| "tasks": [ | ||
| { | ||
| "id": "123", | ||
| "image_id": "123", | ||
| "image_url": "https://minio:9000/...", | ||
| "reply_subject": "$JS.ACK.job_1.job-1-consumer.1.2.3" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ### POST /jobs/{id}/result | ||
carlosgjs marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Worker endpoint to post processing results. | ||
|
|
||
| **Request Body:** | ||
|
|
||
| ```json | ||
| { | ||
| "reply_subject": "$JS.ACK.job_1.job-1-consumer.1.2.3", | ||
| "result": { | ||
| // PipelineResultsResponse or PipelineResultsError | ||
| } | ||
| } | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.