Commit 17bf10f
fix: PSv2 follow-up fixes from integration tests (RolnickLab#1135)
* fix: prevent NATS connection flooding and stale job task fetching
- Add connect_timeout=5, allow_reconnect=False to NATS connections to
prevent leaked reconnection loops from blocking Django's event loop
- Guard /tasks endpoint against terminal-status jobs (return empty tasks
instead of attempting NATS reserve)
- IncompleteJobFilter now excludes jobs by top-level status in addition
to progress JSON stages
- Add stale worker cleanup to integration test script
Found during PSv2 integration testing where stale ADC workers with
default DataLoader parallelism overwhelmed the single uvicorn worker
thread by flooding /tasks with concurrent NATS reserve requests.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: PSv2 integration test session notes and NATS flooding findings
Session notes from 2026-02-16 integration test including root cause
analysis of stale worker task competition and NATS connection issues.
Findings doc tracks applied fixes and remaining TODOs with priorities.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: update session notes with successful test run #3
PSv2 integration test passed end-to-end (job 1380, 20/20 images).
Identified ack_wait=300s as cause of ~5min idle time when GPU
processes race for NATS tasks.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: batch NATS task fetch to prevent HTTP timeouts
Replace N×1 reserve_task() calls with single reserve_tasks() batch
fetch. The previous implementation created a new pull subscription per
message (320 NATS round trips for batch=64), causing the /tasks endpoint
to exceed HTTP client timeouts. The new approach uses one psub.fetch()
call for the entire batch.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: add next session prompt
* feat: add pipeline__slug__in filter for multi-pipeline job queries
Workers that handle multiple pipelines can now fetch jobs for all of them
in a single request: ?pipeline__slug__in=slug1,slug2
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: remove local-only docs and scripts from branch
These files are session notes, planning docs, and test scripts that
should stay local rather than be part of the PR.
Co-Authored-By: Claude <noreply@anthropic.com>
* feat: set job dispatch_mode at creation time based on project feature flags
ML jobs with a pipeline now get dispatch_mode set during setup() instead
of waiting until run() is called by the Celery worker. This lets the UI
show the correct mode immediately after job creation.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: add timeouts to all JetStream operations and restore reconnect policy
Add NATS_JETSTREAM_TIMEOUT (10s) to all JetStream metadata operations
via asyncio.wait_for() so a hung NATS connection fails fast instead of
blocking the caller's thread indefinitely. Also restore the intended
reconnect policy (2 attempts, 1s wait) that was lost in a prior force push.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: propagate NATS timeouts as 503 instead of swallowing them
asyncio.TimeoutError from _ensure_stream() and _ensure_consumer() was
caught by the broad `except Exception` in reserve_tasks(), silently
returning [] and making NATS outages indistinguishable from empty queues.
Workers would then poll immediately, recreating the flooding problem.
- Add explicit `except asyncio.TimeoutError: raise` in reserve_tasks()
- Catch TimeoutError and OSError in the /tasks view, return 503
- Restore allow_reconnect=False (fail-fast on connection issues)
- Add return type annotation to get_connection()
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: address review comments (log level, fetch timeout, docstring)
- Downgrade reserve_tasks log to DEBUG when zero tasks reserved (avoid
log spam from frequent polling)
- Pass timeout=0.5 from /tasks endpoint to avoid blocking the worker
for 5s on empty queues
- Fix docstring examples using string 'job123' for int-typed job_id
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: catch nats.errors.Error in /tasks endpoint for proper 503 responses
NoServersError, ConnectionClosedError, and other NATS exceptions inherit
from nats.errors.Error (not OSError), so they escaped the handler and
returned 500 instead of 503.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent cf3a59e commit 17bf10f
File tree
5 files changed
+220
-94
lines changed- ami
- jobs
- ml/orchestration
- tests
5 files changed
+220
-94
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
461 | 461 | | |
462 | 462 | | |
463 | 463 | | |
464 | | - | |
465 | | - | |
466 | | - | |
| 464 | + | |
467 | 465 | | |
468 | 466 | | |
469 | 467 | | |
| |||
473 | 471 | | |
474 | 472 | | |
475 | 473 | | |
476 | | - | |
477 | | - | |
478 | 474 | | |
479 | 475 | | |
480 | 476 | | |
| |||
919 | 915 | | |
920 | 916 | | |
921 | 917 | | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
922 | 927 | | |
923 | 928 | | |
924 | 929 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
384 | 384 | | |
385 | 385 | | |
386 | 386 | | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
387 | 417 | | |
388 | 418 | | |
389 | 419 | | |
| |||
571 | 601 | | |
572 | 602 | | |
573 | 603 | | |
574 | | - | |
| 604 | + | |
575 | 605 | | |
576 | | - | |
| 606 | + | |
577 | 607 | | |
578 | 608 | | |
579 | | - | |
580 | | - | |
581 | 609 | | |
582 | 610 | | |
583 | 611 | | |
| |||
614 | 642 | | |
615 | 643 | | |
616 | 644 | | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
617 | 678 | | |
618 | 679 | | |
619 | 680 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
1 | 2 | | |
2 | 3 | | |
| 4 | + | |
3 | 5 | | |
4 | 6 | | |
5 | 7 | | |
| |||
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| 37 | + | |
35 | 38 | | |
36 | 39 | | |
37 | 40 | | |
| |||
55 | 58 | | |
56 | 59 | | |
57 | 60 | | |
58 | | - | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
59 | 65 | | |
60 | 66 | | |
61 | | - | |
62 | | - | |
63 | 67 | | |
64 | 68 | | |
65 | 69 | | |
| |||
233 | 237 | | |
234 | 238 | | |
235 | 239 | | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
236 | 244 | | |
237 | 245 | | |
238 | 246 | | |
| |||
241 | 249 | | |
242 | 250 | | |
243 | 251 | | |
244 | | - | |
245 | 252 | | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
254 | 260 | | |
255 | 261 | | |
256 | 262 | | |
| |||
0 commit comments