Skip to content

Commit b81eb44

Browse files
committed
Refactor recover command
1 parent e966d98 commit b81eb44

File tree

6 files changed

+140
-70
lines changed

6 files changed

+140
-70
lines changed

docs/src/core/reference/cli-cheatsheet.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,8 @@
5858
| `torc status <id>` | Workflow status and job summary |
5959
| `torc workflows check-resources <id>` | Check memory/CPU/time usage |
6060
| `torc results list <id> --include-logs` | Job results with log paths |
61-
| `torc recover <id>` | One-shot recovery (diagnose + fix + resubmit) |
61+
| `torc recover <id>` | Interactive recovery wizard (default) |
62+
| `torc recover <id> --no-prompts` | Automatic recovery (no prompts, for scripting) |
6263
| `torc watch <id> --recover --auto-schedule` | Full production recovery mode |
6364
| `torc workflows sync-status <id>` | Fix orphaned jobs (stuck in "running") |
6465
| `torc workflows correct-resources <id>` | Upscale violated + downsize over-allocated RRs |

docs/src/core/reference/cli.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,11 @@ Diagnoses job failures (OOM, timeout), adjusts resource requirements, and resubm
279279
a workflow has completed with failures. For continuous monitoring, use `torc watch --recover`
280280
instead.
281281

282+
By default, runs an interactive wizard that displays failed jobs, lets you choose per-category
283+
actions (retry with adjusted resources or skip), select a Slurm scheduler, and confirm before
284+
executing. Use `--no-prompts` to skip the wizard and apply heuristics automatically. When stdin is
285+
not a terminal (e.g., piped or scripted), non-interactive mode is used automatically.
286+
282287
**Usage:** `torc recover [OPTIONS] <WORKFLOW_ID>`
283288

284289
### Arguments
@@ -294,6 +299,7 @@ instead.
294299
- `--retry-unknown` — Also retry jobs with unknown failure causes
295300
- `--recovery-hook <RECOVERY_HOOK>` — Custom recovery script for unknown failures
296301
- `--dry-run` — Show what would be done without making any changes
302+
- `--no-prompts` — Skip interactive wizard and apply heuristics automatically
297303

298304
### When to Use
299305

@@ -312,12 +318,15 @@ Use `torc watch --recover` instead for:
312318
### Examples
313319

314320
```bash
315-
# Basic recovery
321+
# Interactive recovery (default)
316322
torc recover 123
317323

318324
# Dry run to preview changes without modifying anything
319325
torc recover 123 --dry-run
320326

327+
# Skip interactive prompts (for scripting)
328+
torc recover 123 --no-prompts
329+
321330
# Custom resource multipliers
322331
torc recover 123 --memory-multiplier 2.0 --runtime-multiplier 1.5
323332

docs/src/specialized/fault-tolerance/automatic-recovery.md

Lines changed: 102 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -92,14 +92,92 @@ This data is analyzed to determine failure causes:
9292
For one-shot recovery when a workflow has failed:
9393

9494
```bash
95-
# Preview what would be done (recommended first step)
95+
# Interactive recovery (default when running in a terminal)
96+
torc recover 42
97+
98+
# Preview what would be done without making changes
9699
torc recover 42 --dry-run
97100

98-
# Execute the recovery
99-
torc recover 42
101+
# Skip interactive prompts (for scripting)
102+
torc recover 42 --no-prompts
103+
```
104+
105+
### Interactive Recovery Wizard
106+
107+
By default, `torc recover` runs an interactive wizard that guides you through the recovery process
108+
step by step:
109+
110+
1. **Diagnose failures** — Categorizes failed jobs into OOM, timeout, and unknown failures and
111+
displays a summary table
112+
2. **Per-category decisions** — For each failure category, choose to retry with adjusted resources,
113+
customize the multiplier, or skip
114+
3. **Scheduler selection** — Choose to auto-generate new Slurm schedulers or reuse an existing one
115+
(with optional walltime override and allocation count)
116+
4. **Review and confirm** — Shows the full recovery plan and asks for confirmation before executing
117+
118+
The wizard runs automatically when stdin is a terminal. When piped or scripted (non-TTY), the
119+
command falls back to automatic mode. Use `--no-prompts` to explicitly skip the wizard.
120+
121+
#### Example Session
122+
100123
```
124+
=== Recovery Wizard ===
125+
126+
Diagnosing failures for workflow 42...
127+
128+
OOM Failures (3 jobs):
129+
ID Name RC Memory Peak Memory Reason
130+
--- ---- --- ------ ----------- ------
131+
107 train_model_7 137 8g 10.2 GB sigkill_137
132+
112 train_model_12 137 8g 9.8 GB memory_exceeded
133+
123 train_model_23 137 8g 11.1 GB sigkill_137
134+
135+
Timeout Failures (1 job):
136+
ID Name RC Runtime Exec (min) Reason
137+
--- ---- --- ------- ---------- ------
138+
145 postprocess 152 PT30M 29.8 sigxcpu_152
139+
140+
OOM failures (3 jobs): [R]etry with 1.5x memory / [A]djust multiplier / [S]kip (default: R): r
141+
Timeout failures (1 job): [R]etry with 1.4x runtime / [A]djust multiplier / [S]kip (default: R): a
142+
Enter runtime multiplier [default: 1.4]: 2.0
143+
144+
--- Recovery Plan ---
145+
146+
Memory: 8g -> 12g (1.5x) for 3 jobs: train_model_7, train_model_12, train_model_23
147+
Runtime: PT30M -> PT1H (2x) for 1 job: postprocess
148+
149+
Total: 4 jobs to retry
150+
151+
--- Slurm Scheduler ---
101152
102-
This command:
153+
Existing schedulers for this workflow:
154+
155+
ID Name Account Partition Walltime Nodes
156+
--- ---- ------- --------- -------- -----
157+
5 gpu_scheduler myproject gpu 04:00:00 1
158+
159+
Scheduler: [A]uto-generate new / [E]xisting (enter ID) (default: A): e
160+
Enter scheduler ID: 5
161+
Walltime [default: 04:00:00] (press Enter to keep): 06:00:00
162+
Creating new scheduler with walltime 06:00:00...
163+
Created scheduler 'gpu_scheduler_recovery' (ID 8) with walltime 06:00:00
164+
Number of allocations [default: 1]: 2
165+
166+
Scheduler: gpu_scheduler_recovery (ID 8), 2 allocation(s)
167+
168+
Proceed with recovery? (y/N): y
169+
```
170+
171+
### Non-Interactive Mode
172+
173+
Use `--no-prompts` to skip the wizard and apply recovery heuristics automatically. This is useful
174+
for scripting or when you want the default behavior without interaction:
175+
176+
```bash
177+
torc recover 42 --no-prompts
178+
```
179+
180+
In non-interactive mode, the command:
103181

104182
1. Detects and cleans up orphaned jobs from terminated Slurm allocations
105183
2. Checks that the workflow is complete and no workers are active
@@ -109,9 +187,8 @@ This command:
109187
6. Resets failed jobs and regenerates Slurm schedulers
110188
7. Submits new allocations
111189

112-
> **Note:** Step 1 (orphan cleanup) handles the case where Slurm terminated an allocation
113-
> unexpectedly, leaving jobs stuck in "running" status. This is done automatically before checking
114-
> preconditions.
190+
> **Note:** Orphan cleanup handles the case where Slurm terminated an allocation unexpectedly,
191+
> leaving jobs stuck in "running" status. This is done automatically before checking preconditions.
115192
116193
### Options
117194

@@ -121,25 +198,8 @@ torc recover <workflow_id> \
121198
--runtime-multiplier 1.4 \ # Runtime increase factor for timeout (default: 1.4)
122199
--retry-unknown \ # Also retry jobs with unknown failure causes
123200
--recovery-hook "bash fix.sh" \ # Custom script for unknown failures
124-
--dry-run # Preview without making changes
125-
```
126-
127-
### Example Output
128-
129-
```
130-
Diagnosing failures...
131-
Applying recovery heuristics...
132-
Job 107 (train_model): OOM detected, increasing memory 8g -> 12g
133-
Applied fixes: 1 OOM, 0 timeout
134-
Resetting 1 job(s) for retry...
135-
Reset 1 job(s)
136-
Reinitializing workflow...
137-
Regenerating Slurm schedulers...
138-
Submitted Slurm allocation with 1 job
139-
140-
Recovery complete for workflow 42
141-
- 1 job(s) had memory increased
142-
Reset 1 job(s). Slurm schedulers regenerated and submitted.
201+
--dry-run \ # Preview without making changes
202+
--no-prompts # Skip interactive wizard
143203
```
144204

145205
## The `torc watch --recover` Command
@@ -263,13 +323,13 @@ With default settings:
263323

264324
## Choosing the Right Command
265325

266-
| Use Case | Command |
267-
| --------------------------------- | ------------------------ |
268-
| One-shot recovery after failure | `torc recover` |
269-
| Continuous monitoring | `torc watch -r` |
270-
| Preview what recovery would do | `torc recover --dry-run` |
271-
| Production long-running workflows | `torc watch -r` |
272-
| Manual investigation, then retry | `torc recover` |
326+
| Use Case | Command |
327+
| ---------------------------------- | --------------------------- |
328+
| Interactive recovery after failure | `torc recover` |
329+
| Automatic recovery (scripting) | `torc recover --no-prompts` |
330+
| Continuous monitoring | `torc watch -r` |
331+
| Preview what recovery would do | `torc recover --dry-run` |
332+
| Production long-running workflows | `torc watch -r` |
273333

274334
## Complete Workflow Example
275335

@@ -530,16 +590,16 @@ If jobs are requesting more resources than partitions allow:
530590
2. Use smaller multipliers
531591
3. Consider splitting jobs into smaller pieces
532592

533-
## Comparison: Automatic vs Manual Recovery
593+
## Comparison: Interactive vs Automatic vs AI-Assisted Recovery
534594

535-
| Feature | Automatic | Manual/AI-Assisted |
536-
| ---------------------- | -------------------- | ----------------------- |
537-
| Human involvement | None | Interactive |
538-
| Speed | Fast | Depends on human |
539-
| Handles OOM/timeout | Yes | Yes |
540-
| Handles unknown errors | Retry only | Full investigation |
541-
| Cost optimization | Basic | Can be sophisticated |
542-
| Use case | Production workflows | Debugging, optimization |
595+
| Feature | Interactive (`torc recover`) | Automatic (`--no-prompts`) | AI-Assisted |
596+
| ---------------------- | ---------------------------- | -------------------------- | ------------- |
597+
| Human involvement | Guided wizard | None | AI + human |
598+
| Speed | Minutes | Fast | Varies |
599+
| Handles OOM/timeout | Yes | Yes | Yes |
600+
| Handles unknown errors | User chooses | Retry only | Investigation |
601+
| Scheduler control | Choose or auto-generate | Auto-generate | Manual |
602+
| Use case | Most recovery scenarios | Scripting, `torc watch` | Complex bugs |
543603

544604
## Implementation Details
545605

src/cli.rs

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -488,33 +488,34 @@ SEE ALSO:
488488
/// Diagnoses job failures (OOM, timeout), adjusts resource requirements,
489489
/// and resubmits jobs. Use after a workflow has completed with failures.
490490
///
491-
/// This command:
491+
/// By default, runs an interactive wizard that walks you through:
492492
///
493-
/// 1. Checks preconditions (workflow complete, no active workers)
493+
/// 1. Displaying failed jobs grouped by failure type (OOM, timeout, unknown)
494494
///
495-
/// 2. Diagnoses failures using resource utilization data
495+
/// 2. Choosing per-category actions: retry with adjusted resources or skip
496496
///
497-
/// 3. Applies recovery heuristics (increase memory/runtime)
497+
/// 3. Selecting Slurm scheduler (auto-generate or reuse existing)
498498
///
499-
/// 4. Runs optional recovery hook for custom logic
499+
/// 4. Confirming and executing recovery
500500
///
501-
/// 5. Resets failed jobs and regenerates Slurm schedulers
502-
///
503-
/// 6. Submits new allocations
501+
/// Use --no-prompts to skip the wizard and apply heuristics automatically.
504502
///
505503
/// For continuous monitoring with automatic recovery, use `torc watch --recover`.
506504
#[command(
507505
hide = true,
508506
after_long_help = "\
509507
EXAMPLES:
510508
511-
# Basic recovery
509+
# Interactive recovery (default)
512510
torc recover 123
513511
514512
# Dry run to preview changes without modifying anything
515513
torc recover 123 --dry-run
516514
517-
# Custom resource multipliers
515+
# Skip interactive prompts (for scripting)
516+
torc recover 123 --no-prompts
517+
518+
# Custom resource multipliers (with or without prompts)
518519
torc recover 123 --memory-multiplier 2.0 --runtime-multiplier 1.5
519520
520521
# Also retry unknown failures (not just OOM/timeout)
@@ -581,15 +582,14 @@ SEE ALSO:
581582
#[arg(long)]
582583
dry_run: bool,
583584

584-
/// [EXPERIMENTAL] Enable interactive recovery wizard
585+
/// Skip interactive prompts and apply recovery automatically
585586
///
586-
/// Walks you through a guided recovery process:
587-
/// 1. Display failed jobs with diagnosed failure reasons
588-
/// 2. For each failure category, choose: retry as-is / adjust resources / skip
589-
/// 3. Confirm resource adjustments (memory, runtime multipliers)
590-
/// 4. Confirm and execute recovery
587+
/// By default, `torc recover` runs an interactive wizard that lets you
588+
/// review failures, choose which jobs to retry, adjust resource multipliers,
589+
/// and select Slurm schedulers. Use --no-prompts to skip the wizard and
590+
/// apply recovery heuristics automatically (useful for scripting).
591591
#[arg(long)]
592-
interactive: bool,
592+
no_prompts: bool,
593593

594594
/// [EXPERIMENTAL] Enable AI-assisted recovery for pending_failed jobs
595595
///

src/client/commands/recover.rs

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ pub struct RecoverArgs {
4545
pub retry_unknown: bool,
4646
pub recovery_hook: Option<String>,
4747
pub dry_run: bool,
48-
/// Enable interactive recovery wizard
48+
/// Run the interactive recovery wizard (default when stdin is a TTY)
4949
pub interactive: bool,
5050
/// [EXPERIMENTAL] Enable AI-assisted recovery for pending_failed jobs
5151
pub ai_recovery: bool,
@@ -1044,17 +1044,14 @@ fn prompt_multiplier(label: &str, default: f64) -> Result<f64, String> {
10441044
}
10451045
}
10461046

1047-
/// Interactive recovery wizard — guides the user through failure diagnosis,
1048-
/// resource adjustment selection, and confirmation before executing recovery.
1047+
/// Interactive recovery wizard (default when stdin is a TTY). Guides the user
1048+
/// through failure diagnosis, resource adjustment, and scheduler selection.
10491049
fn recover_workflow_interactive(
10501050
config: &Configuration,
10511051
args: &RecoverArgs,
10521052
) -> Result<RecoveryResult, String> {
10531053
// --- Diagnose failures ---------------------------------------------------
1054-
eprintln!("\n=== Recovery Wizard [EXPERIMENTAL] ===\n");
1055-
eprintln!(
1056-
"Note: Interactive recovery is experimental. Use --dry-run to preview without changes.\n"
1057-
);
1054+
eprintln!("\n=== Recovery Wizard ===\n");
10581055
eprintln!("Diagnosing failures for workflow {}...\n", args.workflow_id);
10591056

10601057
let diagnosis = diagnose_failures(config, args.workflow_id)?;

src/main.rs

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
use std::io::IsTerminal;
2+
13
use clap::{CommandFactory, Parser};
24

35
use torc::cli::{Cli, Commands};
@@ -467,10 +469,11 @@ fn main() {
467469
retry_unknown,
468470
recovery_hook,
469471
dry_run,
470-
interactive,
472+
no_prompts,
471473
ai_recovery,
472474
ai_agent,
473475
} => {
476+
let interactive = !no_prompts && std::io::stdin().is_terminal();
474477
let args = RecoverArgs {
475478
workflow_id: *workflow_id,
476479
output_dir: output_dir.clone(),
@@ -479,7 +482,7 @@ fn main() {
479482
retry_unknown: *retry_unknown,
480483
recovery_hook: recovery_hook.clone(),
481484
dry_run: *dry_run,
482-
interactive: *interactive,
485+
interactive,
483486
ai_recovery: *ai_recovery,
484487
ai_agent: ai_agent.clone(),
485488
};

0 commit comments

Comments
 (0)