Testing: Management command for e2e test of ML jobs#1143
Testing: Management command for e2e test of ML jobs#1143carlosgjs wants to merge 14 commits intoRolnickLab:mainfrom
Conversation
👷 Deploy request for antenna-ssec pending review.Visit the deploys page to approve it
|
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughAdds a new Django management command to create, enqueue, and monitor an ML Job end-to-end by locating a SourceImageCollection and Pipeline, creating a Job with a dispatch mode, polling status every 2 seconds, and printing live progress and final results until completion or interruption. Changes
Sequence DiagramsequenceDiagram
actor User
participant CMD as test_ml_job_e2e Command
participant DB as Database
participant JobSvc as Job Creation/Model
participant Queue as Dispatcher/Queue
participant Monitor as Monitor Loop
participant Out as Console Output
User->>CMD: run with collection, pipeline, dispatch args
CMD->>DB: query SourceImageCollection
alt not found
CMD->>Out: print error & exit
end
CMD->>DB: query Pipeline
alt not found
CMD->>Out: print error & exit
end
CMD->>JobSvc: create/save Job (with dispatch mode)
JobSvc->>DB: persist Job
CMD->>Queue: enqueue Job
CMD->>Monitor: start _monitor_job(job, start_time)
loop every 2s until final state or interrupt
Monitor->>DB: poll job status & stages
Monitor->>Out: update single-line progress (elapsed, status, per-stage)
alt final state reached
Monitor->>Out: call _show_final_results
Monitor->>Monitor: exit loop
end
end
alt KeyboardInterrupt
Monitor->>Out: print Job ID and stop monitoring
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
ami/jobs/management/commands/test_ml_job_e2e.py (2)
40-44: Samefrom Nonechaining needed here.Per Ruff B904, chain the re-raised
CommandErrorto suppress the noisy inner traceback.Proposed fix
except Pipeline.DoesNotExist: - raise CommandError(f"Pipeline '{pipeline_slug}' not found") + raise CommandError(f"Pipeline '{pipeline_slug}' not found") from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 40 - 44, In the try/except that looks up Pipeline via Pipeline.objects.get(slug=pipeline_slug), update the exception re-raise to suppress the inner traceback by chaining to None: replace the current "raise CommandError(f\"Pipeline '{pipeline_slug}' not found\")" with raising CommandError from None (i.e., raise CommandError(...) from None) so the caught Pipeline.DoesNotExist does not emit a noisy inner traceback when CommandError is thrown.
71-85: Consider adding an optional timeout to prevent indefinite monitoring.For an E2E test utility, an unbounded polling loop is a minor concern (Ctrl+C works), but a
--timeoutargument would make this more robust for CI/scripted usage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 71 - 85, Add an optional timeout to the E2E job monitor so it cannot poll forever: update the _monitor_job method to accept a timeout (e.g., timeout=None) and while looping check if elapsed >= timeout (when timeout is set) and abort/raise a clear TimeoutError or mark the job as failed and call _show_final_results; wire this timeout into the management command by adding a --timeout CLI option in the command class (parse as seconds, default None) and pass it into _monitor_job when invoked so CI or scripts can limit wait time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/jobs/management/commands/test_ml_job_e2e.py`:
- Around line 123-125: The loop that writes stage parameters skips falsy but
valid values because it tests "if param.value:"; change that check to explicitly
test for None so values like 0, False or empty string are preserved — update the
conditional in the loop iterating over stage.params (the block that calls
self.stdout.write(f" {param.name}: {param.value}")) to use "if param.value is
not None:" so only absent values are skipped.
- Around line 109-114: The three self.stdout.write calls that log job outcomes
(the branches checking JobState.SUCCESS, JobState.FAILURE, and the else branch
using job.status) use f-strings with no interpolation; remove the extraneous
leading "f" from those string literals in the block where job.status is
evaluated so they become normal strings (e.g., change f"✅ Job completed
successfully" etc.); update the three occurrences in the function/command where
these self.stdout.write(...) calls appear (test_ml_job_e2e.py) to silence Ruff
F541.
- Around line 32-37: The current use of
SourceImageCollection.objects.get(name=collection_name) can raise
MultipleObjectsReturned; modify the lookup to handle duplicates by either
catching SourceImageCollection.MultipleObjectsReturned (or
MultipleObjectsReturned from django.core.exceptions) and raising a CommandError
chaining the original exception (raise CommandError(...) from e) with a clear
message, or replace .get(...) with
SourceImageCollection.objects.filter(name=collection_name).first() and then
raise CommandError if result is None or if you detect multiple matches and want
to surface that as an error; update the block around
SourceImageCollection.objects.get, the except path, and ensure you chain the
original exception when re-raising CommandError.
---
Nitpick comments:
In `@ami/jobs/management/commands/test_ml_job_e2e.py`:
- Around line 40-44: In the try/except that looks up Pipeline via
Pipeline.objects.get(slug=pipeline_slug), update the exception re-raise to
suppress the inner traceback by chaining to None: replace the current "raise
CommandError(f\"Pipeline '{pipeline_slug}' not found\")" with raising
CommandError from None (i.e., raise CommandError(...) from None) so the caught
Pipeline.DoesNotExist does not emit a noisy inner traceback when CommandError is
thrown.
- Around line 71-85: Add an optional timeout to the E2E job monitor so it cannot
poll forever: update the _monitor_job method to accept a timeout (e.g.,
timeout=None) and while looping check if elapsed >= timeout (when timeout is
set) and abort/raise a clear TimeoutError or mark the job as failed and call
_show_final_results; wire this timeout into the management command by adding a
--timeout CLI option in the command class (parse as seconds, default None) and
pass it into _monitor_job when invoked so CI or scripts can limit wait time.
| try: | ||
| collection = SourceImageCollection.objects.get(name=collection_name) | ||
| self.stdout.write(self.style.SUCCESS(f"✓ Found collection: {collection.name}")) | ||
| self.stdout.write(f" Project: {collection.project.name}") | ||
| except SourceImageCollection.DoesNotExist: | ||
| raise CommandError(f"SourceImageCollection '{collection_name}' not found") |
There was a problem hiding this comment.
SourceImageCollection.objects.get(name=...) may raise MultipleObjectsReturned.
The name field on SourceImageCollection is not unique. If multiple collections share the same name, this will throw an unhandled MultipleObjectsReturned exception with a raw traceback. Either catch MultipleObjectsReturned or use .filter(...).first() with an appropriate message.
Also, per static analysis (B904), chain the original exception for better diagnostics.
Proposed fix
try:
collection = SourceImageCollection.objects.get(name=collection_name)
self.stdout.write(self.style.SUCCESS(f"✓ Found collection: {collection.name}"))
self.stdout.write(f" Project: {collection.project.name}")
except SourceImageCollection.DoesNotExist:
- raise CommandError(f"SourceImageCollection '{collection_name}' not found")
+ raise CommandError(f"SourceImageCollection '{collection_name}' not found") from None
+ except SourceImageCollection.MultipleObjectsReturned:
+ raise CommandError(
+ f"Multiple SourceImageCollections found with name '{collection_name}'. Use a unique name."
+ ) from None🧰 Tools
🪛 Ruff (0.15.1)
[warning] 37-37: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 37-37: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 32 - 37, The
current use of SourceImageCollection.objects.get(name=collection_name) can raise
MultipleObjectsReturned; modify the lookup to handle duplicates by either
catching SourceImageCollection.MultipleObjectsReturned (or
MultipleObjectsReturned from django.core.exceptions) and raising a CommandError
chaining the original exception (raise CommandError(...) from e) with a clear
message, or replace .get(...) with
SourceImageCollection.objects.filter(name=collection_name).first() and then
raise CommandError if result is None or if you detect multiple matches and want
to surface that as an error; update the block around
SourceImageCollection.objects.get, the except path, and ensure you chain the
original exception when re-raising CommandError.
| for param in stage.params: | ||
| if param.value: | ||
| self.stdout.write(f" {param.name}: {param.value}") |
There was a problem hiding this comment.
if param.value: silently hides falsy values like 0, False, or "".
If a stage parameter can legitimately hold 0 or another falsy value, it will be skipped. Use if param.value is not None: to only skip truly absent values.
Proposed fix
for param in stage.params:
- if param.value:
+ if param.value is not None:
self.stdout.write(f" {param.name}: {param.value}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for param in stage.params: | |
| if param.value: | |
| self.stdout.write(f" {param.name}: {param.value}") | |
| for param in stage.params: | |
| if param.value is not None: | |
| self.stdout.write(f" {param.name}: {param.value}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 123 - 125, The
loop that writes stage parameters skips falsy but valid values because it tests
"if param.value:"; change that check to explicitly test for None so values like
0, False or empty string are preserved — update the conditional in the loop
iterating over stage.params (the block that calls self.stdout.write(f"
{param.name}: {param.value}")) to use "if param.value is not None:" so only
absent values are skipped.
There was a problem hiding this comment.
Pull request overview
This pull request adds a new Django management command test_ml_job_e2e for end-to-end testing of ML job processing. The command creates ML jobs with specified collections and pipelines, monitors their progress in real-time, and displays final results in the console. This is a useful debugging and testing tool for developers working with the ML job pipeline.
Changes:
- Added
test_ml_job_e2e.pymanagement command with support for creating and monitoring ML jobs through different dispatch modes (internal, sync_api, async_api) - Implements real-time progress monitoring with console updates every 2 seconds showing job status and per-stage progress
- Provides final results summary including stage completion status and parameter values
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* fix: prevent NATS connection flooding and stale job task fetching - Add connect_timeout=5, allow_reconnect=False to NATS connections to prevent leaked reconnection loops from blocking Django's event loop - Guard /tasks endpoint against terminal-status jobs (return empty tasks instead of attempting NATS reserve) - IncompleteJobFilter now excludes jobs by top-level status in addition to progress JSON stages - Add stale worker cleanup to integration test script Found during PSv2 integration testing where stale ADC workers with default DataLoader parallelism overwhelmed the single uvicorn worker thread by flooding /tasks with concurrent NATS reserve requests. Co-Authored-By: Claude <noreply@anthropic.com> * docs: PSv2 integration test session notes and NATS flooding findings Session notes from 2026-02-16 integration test including root cause analysis of stale worker task competition and NATS connection issues. Findings doc tracks applied fixes and remaining TODOs with priorities. Co-Authored-By: Claude <noreply@anthropic.com> * docs: update session notes with successful test run #3 PSv2 integration test passed end-to-end (job 1380, 20/20 images). Identified ack_wait=300s as cause of ~5min idle time when GPU processes race for NATS tasks. Co-Authored-By: Claude <noreply@anthropic.com> * fix: batch NATS task fetch to prevent HTTP timeouts Replace N×1 reserve_task() calls with single reserve_tasks() batch fetch. The previous implementation created a new pull subscription per message (320 NATS round trips for batch=64), causing the /tasks endpoint to exceed HTTP client timeouts. The new approach uses one psub.fetch() call for the entire batch. Co-Authored-By: Claude <noreply@anthropic.com> * docs: add next session prompt * feat: add pipeline__slug__in filter for multi-pipeline job queries Workers that handle multiple pipelines can now fetch jobs for all of them in a single request: ?pipeline__slug__in=slug1,slug2 Co-Authored-By: Claude <noreply@anthropic.com> * chore: remove local-only docs and scripts from branch These files are session notes, planning docs, and test scripts that should stay local rather than be part of the PR. Co-Authored-By: Claude <noreply@anthropic.com> * feat: set job dispatch_mode at creation time based on project feature flags ML jobs with a pipeline now get dispatch_mode set during setup() instead of waiting until run() is called by the Celery worker. This lets the UI show the correct mode immediately after job creation. Co-Authored-By: Claude <noreply@anthropic.com> * fix: add timeouts to all JetStream operations and restore reconnect policy Add NATS_JETSTREAM_TIMEOUT (10s) to all JetStream metadata operations via asyncio.wait_for() so a hung NATS connection fails fast instead of blocking the caller's thread indefinitely. Also restore the intended reconnect policy (2 attempts, 1s wait) that was lost in a prior force push. Co-Authored-By: Claude <noreply@anthropic.com> * fix: propagate NATS timeouts as 503 instead of swallowing them asyncio.TimeoutError from _ensure_stream() and _ensure_consumer() was caught by the broad `except Exception` in reserve_tasks(), silently returning [] and making NATS outages indistinguishable from empty queues. Workers would then poll immediately, recreating the flooding problem. - Add explicit `except asyncio.TimeoutError: raise` in reserve_tasks() - Catch TimeoutError and OSError in the /tasks view, return 503 - Restore allow_reconnect=False (fail-fast on connection issues) - Add return type annotation to get_connection() Co-Authored-By: Claude <noreply@anthropic.com> * fix: address review comments (log level, fetch timeout, docstring) - Downgrade reserve_tasks log to DEBUG when zero tasks reserved (avoid log spam from frequent polling) - Pass timeout=0.5 from /tasks endpoint to avoid blocking the worker for 5s on empty queues - Fix docstring examples using string 'job123' for int-typed job_id Co-Authored-By: Claude <noreply@anthropic.com> * fix: catch nats.errors.Error in /tasks endpoint for proper 503 responses NoServersError, ConnectionClosedError, and other NATS exceptions inherit from nats.errors.Error (not OSError), so they escaped the handler and returned 500 instead of 503. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…olnickLab#1142) * feat: configurable NATS tuning and gunicorn worker management Rebase onto main after RolnickLab#1135 merge. Keep only the additions unique to this branch: - Make TASK_TTR configurable via NATS_TASK_TTR Django setting (default 30s) - Make max_ack_pending configurable via NATS_MAX_ACK_PENDING setting (default 100) - Local dev: switch to gunicorn+UvicornWorker by default for production parity, with USE_UVICORN=1 escape hatch for raw uvicorn - Production: auto-detect WEB_CONCURRENCY from CPU cores (capped at 8) when not explicitly set in the environment Co-Authored-By: Claude <noreply@anthropic.com> * fix: address PR review comments - Fix max_ack_pending falsy-zero guard (use `is not None` instead of `or`) - Update TaskQueueManager docstring with Args section - Simplify production WEB_CONCURRENCY fallback (just use nproc) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
* fix: include pipeline_slug in MinimalJobSerializer (ids_only response) The ADC worker fetches jobs with ids_only=1 and expects pipeline_slug in the response to know which pipeline to run. Without it, Pydantic validation fails and the worker skips the job. Co-Authored-By: Claude <noreply@anthropic.com> * Update ami/jobs/serializers.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
ami/jobs/management/commands/test_ml_job_e2e.py (2)
32-37:⚠️ Potential issue | 🟡 MinorHandle duplicate collection names and chain CommandError.
SourceImageCollection.objects.get(name=...)can raiseMultipleObjectsReturnedwhen names aren’t unique; also, raisingCommandErrorwithout chaining triggers B904. This was already flagged before.✅ Proposed fix
- except SourceImageCollection.DoesNotExist: - raise CommandError(f"SourceImageCollection '{collection_name}' not found") + except SourceImageCollection.DoesNotExist: + raise CommandError(f"SourceImageCollection '{collection_name}' not found") from None + except SourceImageCollection.MultipleObjectsReturned: + raise CommandError( + f"Multiple SourceImageCollections found with name '{collection_name}'. Use a unique name." + ) from None#!/bin/bash # Inspect SourceImageCollection.name uniqueness to confirm duplicate risk rg -n "class SourceImageCollection" -C4🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 32 - 37, The current lookup using SourceImageCollection.objects.get(name=collection_name) can raise MultipleObjectsReturned and the CommandError is raised without exception chaining; update the try/except to catch both SourceImageCollection.DoesNotExist and SourceImageCollection.MultipleObjectsReturned (or Exception as e) and when raising CommandError use "raise CommandError(... ) from e" so the original exception is chained; reference the lookup call (SourceImageCollection.objects.get) and the raised CommandError to locate and modify the handling.
40-44:⚠️ Potential issue | 🟡 MinorChain CommandError exceptions to avoid B904 violations.
Lines 37 and 44 should use
raise ... from Noneto explicitly handle exception chaining per Ruff B904. When intentionally replacing a caught exception with a more informative one, explicit chaining syntax distinguishes intentional replacement from accidental context loss.Affected locations
Line 37 (SourceImageCollection):
except SourceImageCollection.DoesNotExist: - raise CommandError(f"SourceImageCollection '{collection_name}' not found") + raise CommandError(f"SourceImageCollection '{collection_name}' not found") from NoneLine 44 (Pipeline):
except Pipeline.DoesNotExist: - raise CommandError(f"Pipeline '{pipeline_slug}' not found") + raise CommandError(f"Pipeline '{pipeline_slug}' not found") from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 40 - 44, The except handlers that replace underlying exceptions with CommandError (specifically the block calling Pipeline.objects.get(slug=pipeline_slug) and the similar SourceImageCollection lookup) must explicitly suppress implicit exception chaining; change the raise statements to use "raise CommandError(...) from None" so the new CommandError replaces the original exception intentionally (use the exact symbols Pipeline.objects.get and SourceImageCollection lookup/exception handlers to find the two places to update).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/jobs/management/commands/test_ml_job_e2e.py`:
- Around line 78-81: The loop termination misses the UNKNOWN state because
JobState.final_states() doesn't include it; update the check inside the
monitoring loop to treat UNKNOWN as terminal by changing the condition to check
job.status against both JobState.final_states() and JobState.failed_states()
(e.g., if job.status in JobState.final_states() or job.status in
JobState.failed_states(): call self._show_final_results(job, start_time) and
break) so the loop cannot spin forever when status becomes UNKNOWN.
- Line 26: The handle method currently defines an unused variadic positional
parameter as *args which violates Ruff ARG002; rename it to *_args in the
function signature for handle to mark it intentionally unused and satisfy the
linter (ensure any internal references to the old name are updated or removed).
Keep the rest of the signature and behavior unchanged.
---
Duplicate comments:
In `@ami/jobs/management/commands/test_ml_job_e2e.py`:
- Around line 32-37: The current lookup using
SourceImageCollection.objects.get(name=collection_name) can raise
MultipleObjectsReturned and the CommandError is raised without exception
chaining; update the try/except to catch both SourceImageCollection.DoesNotExist
and SourceImageCollection.MultipleObjectsReturned (or Exception as e) and when
raising CommandError use "raise CommandError(... ) from e" so the original
exception is chained; reference the lookup call
(SourceImageCollection.objects.get) and the raised CommandError to locate and
modify the handling.
- Around line 40-44: The except handlers that replace underlying exceptions with
CommandError (specifically the block calling
Pipeline.objects.get(slug=pipeline_slug) and the similar SourceImageCollection
lookup) must explicitly suppress implicit exception chaining; change the raise
statements to use "raise CommandError(...) from None" so the new CommandError
replaces the original exception intentionally (use the exact symbols
Pipeline.objects.get and SourceImageCollection lookup/exception handlers to find
the two places to update).
| help="Job dispatch mode", | ||
| ) | ||
|
|
||
| def handle(self, *args, **options): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the file exists and view content around line 26
if [ -f "ami/jobs/management/commands/test_ml_job_e2e.py" ]; then
echo "=== File exists, showing lines 20-35 ==="
sed -n '20,35p' ami/jobs/management/commands/test_ml_job_e2e.py | cat -n
else
echo "File not found"
fiRepository: RolnickLab/antenna
Length of output: 849
🏁 Script executed:
#!/bin/bash
# Find Ruff configuration files
echo "=== Looking for Ruff configuration files ==="
fd -t f '(pyproject\.toml|ruff\.toml|\.ruff\.toml|setup\.cfg|tox\.ini)' -x ls -lh {}
echo -e "\n=== Searching for Ruff/ARG configuration ==="
fd -t f '(pyproject\.toml|ruff\.toml|\.ruff\.toml)' -x rg -A 20 "^\[tool\.ruff\]|extend-ignore|ignore.*ARG|ARG002"Repository: RolnickLab/antenna
Length of output: 330
🏁 Script executed:
#!/bin/bash
echo "=== Content of ruff.toml ==="
cat ruff.toml
echo -e "\n=== Content of pyproject.toml ==="
cat pyproject.toml
echo -e "\n=== Content of setup.cfg ==="
cat setup.cfgRepository: RolnickLab/antenna
Length of output: 3366
Rename unused *args to *_args to satisfy Ruff ARG002 enforcement.
Line 26 contains an unused *args parameter that violates the ARG002 rule, which is actively enforced in this repository's Ruff configuration. Rename to *_args to indicate the parameter is intentionally unused.
Suggested fix
- def handle(self, *args, **options):
+ def handle(self, *_args, **options):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def handle(self, *args, **options): | |
| def handle(self, *_args, **options): |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 26-26: Unused method argument: args
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/jobs/management/commands/test_ml_job_e2e.py` at line 26, The handle
method currently defines an unused variadic positional parameter as *args which
violates Ruff ARG002; rename it to *_args in the function signature for handle
to mark it intentionally unused and satisfy the linter (ensure any internal
references to the old name are updated or removed). Keep the rest of the
signature and behavior unchanged.
| # Check if finished | ||
| if job.status in JobState.final_states(): | ||
| self._show_final_results(job, start_time) | ||
| break |
There was a problem hiding this comment.
Terminate monitoring on UNKNOWN to avoid infinite loop.
JobState.final_states() excludes UNKNOWN, yet JobState.failed_states() includes it. If the job ever becomes UNKNOWN, this loop never ends. Treat UNKNOWN as terminal (or use failed_states() in addition to final_states()).
🔧 Suggested fix
- if job.status in JobState.final_states():
+ if job.status in JobState.final_states() or job.status == JobState.UNKNOWN:
self._show_final_results(job, start_time)
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/jobs/management/commands/test_ml_job_e2e.py` around lines 78 - 81, The
loop termination misses the UNKNOWN state because JobState.final_states()
doesn't include it; update the check inside the monitoring loop to treat UNKNOWN
as terminal by changing the condition to check job.status against both
JobState.final_states() and JobState.failed_states() (e.g., if job.status in
JobState.final_states() or job.status in JobState.failed_states(): call
self._show_final_results(job, start_time) and break) so the loop cannot spin
forever when status becomes UNKNOWN.
This pull request introduces a new management command for end-to-end testing of ML jobs in the Django project. The command allows users to create and monitor ML jobs using specified collections and pipelines, providing real-time progress updates and final results in the console.
New ML job testing command:
test_ml_job_e2e.pymanagement command for end-to-end testing of ML job processing, including job creation, progress monitoring, and final results display.SourceImageCollection,Pipeline, andJobDispatchModevia command-line arguments, with informative error handling for missing resources.Summary by CodeRabbit