Skip to content

Commit 08f10b4

Browse files
daniel-thomclaude
andauthored
Add checks for invalid resource definitions (#243)
* Change default execution mode to direct The Slurm execution mode does not work in all HPC environments. - In one case the sacct command submitted after job completion always failed. - In another case the HPC admin stated a strong preference to use direct mode in order to reduce the load on the Slurm servers. * Validate job runtime does not exceed slurm scheduler walltime (#238) Add validation during workflow creation that checks resource_requirements runtime against slurm scheduler walltime. For jobs with an explicit scheduler, runtime must not exceed that scheduler's walltime. For jobs without an explicit scheduler, at least one scheduler must have a sufficient walltime since any scheduler can pick up unassigned jobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Expand scheduler resource validation to memory and GPUs with interactive prompts Replaces the hard-error runtime-vs-walltime check with a broader validate_scheduler_resources that checks runtime, memory, and GPUs against slurm scheduler allocations. Resource warnings are now shown interactively with a y/N prompt in CLI commands (create, run, submit). Non-interactive contexts (MCP server, piped stdin) hard-fail with a message suggesting --skip-checks. Scheduler fields that are not set (mem, gres) are skipped for that dimension. Unassigned jobs must have at least one scheduler suitable across all dimensions simultaneously. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Slurm compute nodes created with is_active=NULL causing false orphan detection The Slurm job runner's create_compute_node() was not setting is_active=true, unlike the local job runner in run_jobs_cmd.rs. This caused the watcher's orphan detection to miss active Slurm compute nodes (SQL NULL != 1), falsely marking running jobs as orphaned with return code -128. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix direct mode OOM-killing jobs with default resource requirements In srun mode, jobs with the "default" resource requirement (no explicit resource_requirements) skip --mem to avoid artificially constraining them. Direct mode was not doing the same — it enforced the default RR's 1 MiB memory limit, causing the resource monitor to SIGKILL any job that allocates real memory. Skip memory limit enforcement for default RRs in direct mode to match srun behavior. Also re-applies job_parallelism test to mode: direct now that this works. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add grace period before orphan detection in torc watch Require 3 consecutive polls with no valid Slurm allocation before running orphan cleanup. This prevents a race where a Slurm job is just starting up and squeue hasn't reflected the new job yet, causing the watcher to falsely mark running jobs as orphaned with return code -128. With a 10-second poll interval, this gives ~30 seconds for new allocations to become visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix example walltime mismatch and remove bogus srun_termination_signal warning - slurm_staged_pipeline: increase work_scheduler walltime from 04:00:00 to 08:00:00 to match work_resources runtime PT8H (YAML, JSON5, KDL) - Remove false warning comparing srun_termination_signal time to sigkill_headroom_seconds — these are independent (signal fires relative to the step's --time, not the allocation end) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add interactive recovery Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ca06651 commit 08f10b4

27 files changed

+1635
-399
lines changed

docs/src/core/reference/cli-cheatsheet.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,8 @@
5858
| `torc status <id>` | Workflow status and job summary |
5959
| `torc workflows check-resources <id>` | Check memory/CPU/time usage |
6060
| `torc results list <id> --include-logs` | Job results with log paths |
61-
| `torc recover <id>` | One-shot recovery (diagnose + fix + resubmit) |
61+
| `torc recover <id>` | Interactive recovery wizard (default) |
62+
| `torc recover <id> --no-prompts` | Automatic recovery (no prompts, for scripting) |
6263
| `torc watch <id> --recover --auto-schedule` | Full production recovery mode |
6364
| `torc workflows sync-status <id>` | Fix orphaned jobs (stuck in "running") |
6465
| `torc workflows correct-resources <id>` | Upscale violated + downsize over-allocated RRs |

docs/src/core/reference/cli.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,11 @@ Diagnoses job failures (OOM, timeout), adjusts resource requirements, and resubm
279279
a workflow has completed with failures. For continuous monitoring, use `torc watch --recover`
280280
instead.
281281

282+
By default, runs an interactive wizard that displays failed jobs, lets you choose per-category
283+
actions (retry with adjusted resources or skip), select a Slurm scheduler, and confirm before
284+
executing. Use `--no-prompts` to skip the wizard and apply heuristics automatically. When stdin is
285+
not a terminal (e.g., piped or scripted), non-interactive mode is used automatically.
286+
282287
**Usage:** `torc recover [OPTIONS] <WORKFLOW_ID>`
283288

284289
### Arguments
@@ -294,6 +299,7 @@ instead.
294299
- `--retry-unknown` — Also retry jobs with unknown failure causes
295300
- `--recovery-hook <RECOVERY_HOOK>` — Custom recovery script for unknown failures
296301
- `--dry-run` — Show what would be done without making any changes
302+
- `--no-prompts` — Skip interactive wizard and apply heuristics automatically
297303

298304
### When to Use
299305

@@ -312,12 +318,15 @@ Use `torc watch --recover` instead for:
312318
### Examples
313319

314320
```bash
315-
# Basic recovery
321+
# Interactive recovery (default)
316322
torc recover 123
317323

318324
# Dry run to preview changes without modifying anything
319325
torc recover 123 --dry-run
320326

327+
# Skip interactive prompts (for scripting)
328+
torc recover 123 --no-prompts
329+
321330
# Custom resource multipliers
322331
torc recover 123 --memory-multiplier 2.0 --runtime-multiplier 1.5
323332

docs/src/specialized/fault-tolerance/automatic-recovery.md

Lines changed: 102 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -92,14 +92,92 @@ This data is analyzed to determine failure causes:
9292
For one-shot recovery when a workflow has failed:
9393

9494
```bash
95-
# Preview what would be done (recommended first step)
95+
# Interactive recovery (default when running in a terminal)
96+
torc recover 42
97+
98+
# Preview what would be done without making changes
9699
torc recover 42 --dry-run
97100

98-
# Execute the recovery
99-
torc recover 42
101+
# Skip interactive prompts (for scripting)
102+
torc recover 42 --no-prompts
103+
```
104+
105+
### Interactive Recovery Wizard
106+
107+
By default, `torc recover` runs an interactive wizard that guides you through the recovery process
108+
step by step:
109+
110+
1. **Diagnose failures** — Categorizes failed jobs into OOM, timeout, and unknown failures and
111+
displays a summary table
112+
2. **Per-category decisions** — For each failure category, choose to retry with adjusted resources,
113+
customize the multiplier, or skip
114+
3. **Scheduler selection** — Choose to auto-generate new Slurm schedulers or reuse an existing one
115+
(with optional walltime override and allocation count)
116+
4. **Review and confirm** — Shows the full recovery plan and asks for confirmation before executing
117+
118+
The wizard runs automatically when stdin is a terminal. When piped or scripted (non-TTY), the
119+
command falls back to automatic mode. Use `--no-prompts` to explicitly skip the wizard.
120+
121+
#### Example Session
122+
100123
```
124+
=== Recovery Wizard ===
125+
126+
Diagnosing failures for workflow 42...
127+
128+
OOM Failures (3 jobs):
129+
ID Name RC Memory Peak Memory Reason
130+
--- ---- --- ------ ----------- ------
131+
107 train_model_7 137 8g 10.2 GB sigkill_137
132+
112 train_model_12 137 8g 9.8 GB memory_exceeded
133+
123 train_model_23 137 8g 11.1 GB sigkill_137
134+
135+
Timeout Failures (1 job):
136+
ID Name RC Runtime Exec (min) Reason
137+
--- ---- --- ------- ---------- ------
138+
145 postprocess 152 PT30M 29.8 sigxcpu_152
139+
140+
OOM failures (3 jobs): [R]etry with 1.5x memory / [A]djust multiplier / [S]kip (default: R): r
141+
Timeout failures (1 job): [R]etry with 1.4x runtime / [A]djust multiplier / [S]kip (default: R): a
142+
Enter runtime multiplier [default: 1.4]: 2.0
143+
144+
--- Recovery Plan ---
145+
146+
Memory: 8g -> 12g (1.5x) for 3 jobs: train_model_7, train_model_12, train_model_23
147+
Runtime: PT30M -> PT1H (2x) for 1 job: postprocess
148+
149+
Total: 4 jobs to retry
150+
151+
--- Slurm Scheduler ---
101152
102-
This command:
153+
Existing schedulers for this workflow:
154+
155+
ID Name Account Partition Walltime Nodes
156+
--- ---- ------- --------- -------- -----
157+
5 gpu_scheduler myproject gpu 04:00:00 1
158+
159+
Scheduler: [A]uto-generate new / [E]xisting (enter ID) (default: A): e
160+
Enter scheduler ID: 5
161+
Walltime [default: 04:00:00] (press Enter to keep): 06:00:00
162+
Creating new scheduler with walltime 06:00:00...
163+
Created scheduler 'gpu_scheduler_recovery' (ID 8) with walltime 06:00:00
164+
Number of allocations [default: 1]: 2
165+
166+
Scheduler: gpu_scheduler_recovery (ID 8), 2 allocation(s)
167+
168+
Proceed with recovery? (y/N): y
169+
```
170+
171+
### Non-Interactive Mode
172+
173+
Use `--no-prompts` to skip the wizard and apply recovery heuristics automatically. This is useful
174+
for scripting or when you want the default behavior without interaction:
175+
176+
```bash
177+
torc recover 42 --no-prompts
178+
```
179+
180+
In non-interactive mode, the command:
103181

104182
1. Detects and cleans up orphaned jobs from terminated Slurm allocations
105183
2. Checks that the workflow is complete and no workers are active
@@ -109,9 +187,8 @@ This command:
109187
6. Resets failed jobs and regenerates Slurm schedulers
110188
7. Submits new allocations
111189

112-
> **Note:** Step 1 (orphan cleanup) handles the case where Slurm terminated an allocation
113-
> unexpectedly, leaving jobs stuck in "running" status. This is done automatically before checking
114-
> preconditions.
190+
> **Note:** Orphan cleanup handles the case where Slurm terminated an allocation unexpectedly,
191+
> leaving jobs stuck in "running" status. This is done automatically before checking preconditions.
115192
116193
### Options
117194

@@ -121,25 +198,8 @@ torc recover <workflow_id> \
121198
--runtime-multiplier 1.4 \ # Runtime increase factor for timeout (default: 1.4)
122199
--retry-unknown \ # Also retry jobs with unknown failure causes
123200
--recovery-hook "bash fix.sh" \ # Custom script for unknown failures
124-
--dry-run # Preview without making changes
125-
```
126-
127-
### Example Output
128-
129-
```
130-
Diagnosing failures...
131-
Applying recovery heuristics...
132-
Job 107 (train_model): OOM detected, increasing memory 8g -> 12g
133-
Applied fixes: 1 OOM, 0 timeout
134-
Resetting 1 job(s) for retry...
135-
Reset 1 job(s)
136-
Reinitializing workflow...
137-
Regenerating Slurm schedulers...
138-
Submitted Slurm allocation with 1 job
139-
140-
Recovery complete for workflow 42
141-
- 1 job(s) had memory increased
142-
Reset 1 job(s). Slurm schedulers regenerated and submitted.
201+
--dry-run \ # Preview without making changes
202+
--no-prompts # Skip interactive wizard
143203
```
144204

145205
## The `torc watch --recover` Command
@@ -263,13 +323,13 @@ With default settings:
263323

264324
## Choosing the Right Command
265325

266-
| Use Case | Command |
267-
| --------------------------------- | ------------------------ |
268-
| One-shot recovery after failure | `torc recover` |
269-
| Continuous monitoring | `torc watch -r` |
270-
| Preview what recovery would do | `torc recover --dry-run` |
271-
| Production long-running workflows | `torc watch -r` |
272-
| Manual investigation, then retry | `torc recover` |
326+
| Use Case | Command |
327+
| ---------------------------------- | --------------------------- |
328+
| Interactive recovery after failure | `torc recover` |
329+
| Automatic recovery (scripting) | `torc recover --no-prompts` |
330+
| Continuous monitoring | `torc watch -r` |
331+
| Preview what recovery would do | `torc recover --dry-run` |
332+
| Production long-running workflows | `torc watch -r` |
273333

274334
## Complete Workflow Example
275335

@@ -530,16 +590,16 @@ If jobs are requesting more resources than partitions allow:
530590
2. Use smaller multipliers
531591
3. Consider splitting jobs into smaller pieces
532592

533-
## Comparison: Automatic vs Manual Recovery
593+
## Comparison: Interactive vs Automatic vs AI-Assisted Recovery
534594

535-
| Feature | Automatic | Manual/AI-Assisted |
536-
| ---------------------- | -------------------- | ----------------------- |
537-
| Human involvement | None | Interactive |
538-
| Speed | Fast | Depends on human |
539-
| Handles OOM/timeout | Yes | Yes |
540-
| Handles unknown errors | Retry only | Full investigation |
541-
| Cost optimization | Basic | Can be sophisticated |
542-
| Use case | Production workflows | Debugging, optimization |
595+
| Feature | Interactive (`torc recover`) | Automatic (`--no-prompts`) | AI-Assisted |
596+
| ---------------------- | ---------------------------- | -------------------------- | ------------- |
597+
| Human involvement | Guided wizard | None | AI + human |
598+
| Speed | Minutes | Fast | Varies |
599+
| Handles OOM/timeout | Yes | Yes | Yes |
600+
| Handles unknown errors | User chooses | Retry only | Investigation |
601+
| Scheduler control | Choose or auto-generate | Auto-generate | Manual |
602+
| Use case | Most recovery scenarios | Scripting, `torc watch` | Complex bugs |
543603

544604
## Implementation Details
545605

examples/json/slurm_staged_pipeline.json5

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -336,7 +336,7 @@
336336
{
337337
"name": "work_scheduler",
338338
"account": "my_account",
339-
"walltime": "04:00:00",
339+
"walltime": "08:00:00",
340340
"nodes": 1
341341
},
342342
{

examples/kdl/slurm_staged_pipeline.kdl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ slurm_scheduler "setup_scheduler" {
1414

1515
slurm_scheduler "work_scheduler" {
1616
account "my_account"
17-
walltime "04:00:00"
17+
walltime "08:00:00"
1818
nodes 1
1919
}
2020

examples/yaml/slurm_staged_pipeline.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ slurm_schedulers:
1616
nodes: 1
1717
- name: "work_scheduler"
1818
account: "my_account"
19-
walltime: "04:00:00"
19+
walltime: "08:00:00"
2020
nodes: 1
2121
- name: "postprocess_scheduler"
2222
account: "my_account"

slurm-tests/workflows/cancel_workflow.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ name: cancel_workflow
88
description: Workflow cancellation test — cancel while jobs are running
99
project: slurm-tests
1010
execution_config:
11-
mode: slurm
11+
mode: direct
1212

1313
resource_requirements:
1414
- name: sleep_resources

slurm-tests/workflows/failure_recovery.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ description: Test workflow for Slurm job retry with failure handlers
1010
project: slurm-tests
1111
metadata: '{"test_type": "failure_recovery", "stages": 3}'
1212
execution_config:
13-
mode: slurm
13+
mode: direct
1414

1515
failure_handlers:
1616
- name: retry_on_exit_42

slurm-tests/workflows/job_parallelism.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Test: Job-Based Parallelism
22
#
3-
# 1-node allocation with NO resource_requirements on jobs.
4-
# Jobs get the auto-assigned "default" RR, so srun skips resource limit flags
5-
# and each job can use the full allocation's resources.
3+
# 1-node allocation with NO resource_requirements on jobs (direct mode).
4+
# Jobs get the auto-assigned "default" RR and run directly (no srun wrapper),
5+
# so each job can use the full allocation's resources.
66
#
77
# Concurrency is controlled by --max-parallel-jobs (passed via torc submit).
88
# The test submits with --max-parallel-jobs 2, so 2 of the 4 jobs run at a time.
@@ -15,7 +15,7 @@ name: job_parallelism
1515
description: Job-based parallelism — no resource requirements, controlled by --max-parallel-jobs
1616
project: slurm-tests
1717
execution_config:
18-
mode: slurm
18+
mode: direct
1919

2020
resource_monitor:
2121
enabled: true

slurm-tests/workflows/resource_monitoring.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ name: resource_monitoring
88
description: Resource monitoring validation — CPU and memory usage captured
99
project: slurm-tests
1010
execution_config:
11-
mode: slurm
11+
mode: direct
1212

1313
resource_monitor:
1414
enabled: true

0 commit comments

Comments
 (0)