Temporarily remove interest code from monitor loop#17
Temporarily remove interest code from monitor loop#17transientlunatic merged 12 commits intov0.7-previewfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes interest-checking and priority-based submission logic from the monitor loop that was preventing LVK-like workflows from completing, as these workflows were incorrectly being judged as low interest.
Key Changes
- Removed conditional submission logic based on interest status and priority methods from the submit function
- Simplified the workflow to always attempt DAG build and submission for ready productions
- Updated test workflow comments to reflect that result files (not PSD files) are being waited for
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
asimov/cli/manage.py |
Removed ~84 lines of interest checking and priority method logic; simplified submission flow to directly build and submit DAGs for productions not in special states |
.github/workflows/htcondor-tests.yml |
Updated comment from "PSD files" to "result file" and commented out a monitor command |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| pipe.build_dag(dryrun=dryrun) | ||
| except PipelineException as e: | ||
| logger.error( | ||
| "failed to build a DAG file.", | ||
| ) | ||
| logger.exception(e) | ||
| click.echo( | ||
| click.style("●", fg="red") | ||
| + f" Unable to submit {production.name}" | ||
| ) | ||
| except ValueError: | ||
| logger.info("Unable to submit an unbuilt production") | ||
| click.echo( | ||
| click.style("●", fg="red") | ||
| + f" Unable to submit {production.name} as it hasn't been built yet." | ||
| ) | ||
| click.echo("Try running `asimov manage build` first.") | ||
| try: | ||
| pipe.submit_dag(dryrun=dryrun) | ||
| if not dryrun: | ||
| click.echo( | ||
| click.style("●", fg="red") | ||
| + f" Unable to submit {production.name} as it hasn't been built yet." | ||
| click.style("●", fg="green") | ||
| + f" Submitted {production.event.name}/{production.name}" | ||
| ) | ||
| click.echo("Try running `asimov manage build` first.") | ||
| try: | ||
| pipe.submit_dag(dryrun=dryrun) | ||
| if not dryrun: | ||
| click.echo( | ||
| click.style("●", fg="green") | ||
| + f" Submitted {production.event.name}/{production.name}" | ||
| ) | ||
| production.status = "running" | ||
| production.status = "running" | ||
|
|
||
| except PipelineException as e: | ||
| production.status = "stuck" | ||
| click.echo( | ||
| click.style("●", fg="red") | ||
| + f" Unable to submit {production.name}" | ||
| ) | ||
| logger.exception(e) | ||
| ledger.update_event(event) | ||
| logger.error( | ||
| f"The pipeline failed to submit the DAG file to the cluster. {e}", | ||
| ) | ||
| if not dryrun: | ||
| # Refresh the job list | ||
| job_list = condor.CondorJobList() | ||
| job_list.refresh() | ||
| # Update the ledger | ||
| ledger.update_event(event) | ||
| else: | ||
| except PipelineException as e: | ||
| production.status = "stuck" | ||
| click.echo( | ||
| click.style("●", fg="yellow") | ||
| + f"Production {production.name} not ready to submit" | ||
| click.style("●", fg="red") | ||
| + f" Unable to submit {production.name}" | ||
| ) | ||
| logger.exception(e) | ||
| ledger.update_event(event) | ||
| logger.error( | ||
| f"The pipeline failed to submit the DAG file to the cluster. {e}", | ||
| ) |
There was a problem hiding this comment.
The exception handling flow has a critical bug. When pipe.build_dag() fails (lines 461-469 or 470-476), execution continues to the second try block at line 477, which attempts to call pipe.submit_dag(). This means the code will try to submit a DAG file that failed to build. The second try block should only execute if the first try block succeeds. Consider adding an early return or continue statement after each exception block, or restructure the code so that pipe.submit_dag() is only called when pipe.build_dag() succeeds.
.github/workflows/htcondor-tests.yml
Outdated
| TIMEOUT=1200; ELAPSED=0; INTERVAL=30 | ||
| while [ \$ELAPSED -lt \$TIMEOUT ]; do | ||
| condor_q || true | ||
| # $ASIMOV_CMD asimov monitor --chain |
There was a problem hiding this comment.
The commented-out command suggests it was intended to be removed or may need to be re-enabled later. If this command is not needed for testing, consider removing it entirely rather than leaving it commented. If it needs to be re-enabled as part of future work, add a TODO comment explaining why it's commented out and when it should be re-enabled.
| # $ASIMOV_CMD asimov monitor --chain | |
| # TODO: Re-enable "$ASIMOV_CMD asimov monitor --chain" here if detailed chain monitoring is needed during polling. |
This code prevents normal LVK-like workflows from ever completing, as it always appears to judge them to be of low interest.
It needs to be properly refactored.