You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add checks for invalid resource definitions (#243)
* Change default execution mode to direct
The Slurm execution mode does not work in all HPC environments.
- In one case the sacct command submitted after job completion always
failed.
- In another case the HPC admin stated a strong preference to use direct
mode in order to reduce the load on the Slurm servers.
* Validate job runtime does not exceed slurm scheduler walltime (#238)
Add validation during workflow creation that checks resource_requirements
runtime against slurm scheduler walltime. For jobs with an explicit
scheduler, runtime must not exceed that scheduler's walltime. For jobs
without an explicit scheduler, at least one scheduler must have a
sufficient walltime since any scheduler can pick up unassigned jobs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Expand scheduler resource validation to memory and GPUs with interactive prompts
Replaces the hard-error runtime-vs-walltime check with a broader
validate_scheduler_resources that checks runtime, memory, and GPUs
against slurm scheduler allocations. Resource warnings are now shown
interactively with a y/N prompt in CLI commands (create, run, submit).
Non-interactive contexts (MCP server, piped stdin) hard-fail with a
message suggesting --skip-checks. Scheduler fields that are not set
(mem, gres) are skipped for that dimension. Unassigned jobs must have
at least one scheduler suitable across all dimensions simultaneously.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix Slurm compute nodes created with is_active=NULL causing false orphan detection
The Slurm job runner's create_compute_node() was not setting is_active=true,
unlike the local job runner in run_jobs_cmd.rs. This caused the watcher's
orphan detection to miss active Slurm compute nodes (SQL NULL != 1), falsely
marking running jobs as orphaned with return code -128.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix direct mode OOM-killing jobs with default resource requirements
In srun mode, jobs with the "default" resource requirement (no explicit
resource_requirements) skip --mem to avoid artificially constraining them.
Direct mode was not doing the same — it enforced the default RR's 1 MiB
memory limit, causing the resource monitor to SIGKILL any job that
allocates real memory. Skip memory limit enforcement for default RRs in
direct mode to match srun behavior.
Also re-applies job_parallelism test to mode: direct now that this works.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add grace period before orphan detection in torc watch
Require 3 consecutive polls with no valid Slurm allocation before running
orphan cleanup. This prevents a race where a Slurm job is just starting up
and squeue hasn't reflected the new job yet, causing the watcher to
falsely mark running jobs as orphaned with return code -128.
With a 10-second poll interval, this gives ~30 seconds for new allocations
to become visible.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix example walltime mismatch and remove bogus srun_termination_signal warning
- slurm_staged_pipeline: increase work_scheduler walltime from 04:00:00
to 08:00:00 to match work_resources runtime PT8H (YAML, JSON5, KDL)
- Remove false warning comparing srun_termination_signal time to
sigkill_headroom_seconds — these are independent (signal fires relative
to the step's --time, not the allocation end)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add interactive recovery
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments