@@ -92,14 +92,92 @@ This data is analyzed to determine failure causes:
9292For one-shot recovery when a workflow has failed:
9393
9494``` bash
95- # Preview what would be done (recommended first step)
95+ # Interactive recovery (default when running in a terminal)
96+ torc recover 42
97+
98+ # Preview what would be done without making changes
9699torc recover 42 --dry-run
97100
98- # Execute the recovery
99- torc recover 42
101+ # Skip interactive prompts (for scripting)
102+ torc recover 42 --no-prompts
103+ ```
104+
105+ ### Interactive Recovery Wizard
106+
107+ By default, ` torc recover ` runs an interactive wizard that guides you through the recovery process
108+ step by step:
109+
110+ 1 . ** Diagnose failures** — Categorizes failed jobs into OOM, timeout, and unknown failures and
111+ displays a summary table
112+ 2 . ** Per-category decisions** — For each failure category, choose to retry with adjusted resources,
113+ customize the multiplier, or skip
114+ 3 . ** Scheduler selection** — Choose to auto-generate new Slurm schedulers or reuse an existing one
115+ (with optional walltime override and allocation count)
116+ 4 . ** Review and confirm** — Shows the full recovery plan and asks for confirmation before executing
117+
118+ The wizard runs automatically when stdin is a terminal. When piped or scripted (non-TTY), the
119+ command falls back to automatic mode. Use ` --no-prompts ` to explicitly skip the wizard.
120+
121+ #### Example Session
122+
100123```
124+ === Recovery Wizard ===
125+
126+ Diagnosing failures for workflow 42...
127+
128+ OOM Failures (3 jobs):
129+ ID Name RC Memory Peak Memory Reason
130+ --- ---- --- ------ ----------- ------
131+ 107 train_model_7 137 8g 10.2 GB sigkill_137
132+ 112 train_model_12 137 8g 9.8 GB memory_exceeded
133+ 123 train_model_23 137 8g 11.1 GB sigkill_137
134+
135+ Timeout Failures (1 job):
136+ ID Name RC Runtime Exec (min) Reason
137+ --- ---- --- ------- ---------- ------
138+ 145 postprocess 152 PT30M 29.8 sigxcpu_152
139+
140+ OOM failures (3 jobs): [R]etry with 1.5x memory / [A]djust multiplier / [S]kip (default: R): r
141+ Timeout failures (1 job): [R]etry with 1.4x runtime / [A]djust multiplier / [S]kip (default: R): a
142+ Enter runtime multiplier [default: 1.4]: 2.0
143+
144+ --- Recovery Plan ---
145+
146+ Memory: 8g -> 12g (1.5x) for 3 jobs: train_model_7, train_model_12, train_model_23
147+ Runtime: PT30M -> PT1H (2x) for 1 job: postprocess
148+
149+ Total: 4 jobs to retry
150+
151+ --- Slurm Scheduler ---
101152
102- This command:
153+ Existing schedulers for this workflow:
154+
155+ ID Name Account Partition Walltime Nodes
156+ --- ---- ------- --------- -------- -----
157+ 5 gpu_scheduler myproject gpu 04:00:00 1
158+
159+ Scheduler: [A]uto-generate new / [E]xisting (enter ID) (default: A): e
160+ Enter scheduler ID: 5
161+ Walltime [default: 04:00:00] (press Enter to keep): 06:00:00
162+ Creating new scheduler with walltime 06:00:00...
163+ Created scheduler 'gpu_scheduler_recovery' (ID 8) with walltime 06:00:00
164+ Number of allocations [default: 1]: 2
165+
166+ Scheduler: gpu_scheduler_recovery (ID 8), 2 allocation(s)
167+
168+ Proceed with recovery? (y/N): y
169+ ```
170+
171+ ### Non-Interactive Mode
172+
173+ Use ` --no-prompts ` to skip the wizard and apply recovery heuristics automatically. This is useful
174+ for scripting or when you want the default behavior without interaction:
175+
176+ ``` bash
177+ torc recover 42 --no-prompts
178+ ```
179+
180+ In non-interactive mode, the command:
103181
1041821 . Detects and cleans up orphaned jobs from terminated Slurm allocations
1051832 . Checks that the workflow is complete and no workers are active
@@ -109,9 +187,8 @@ This command:
1091876 . Resets failed jobs and regenerates Slurm schedulers
1101887 . Submits new allocations
111189
112- > ** Note:** Step 1 (orphan cleanup) handles the case where Slurm terminated an allocation
113- > unexpectedly, leaving jobs stuck in "running" status. This is done automatically before checking
114- > preconditions.
190+ > ** Note:** Orphan cleanup handles the case where Slurm terminated an allocation unexpectedly,
191+ > leaving jobs stuck in "running" status. This is done automatically before checking preconditions.
115192
116193### Options
117194
@@ -121,25 +198,8 @@ torc recover <workflow_id> \
121198 --runtime-multiplier 1.4 \ # Runtime increase factor for timeout (default: 1.4)
122199 --retry-unknown \ # Also retry jobs with unknown failure causes
123200 --recovery-hook " bash fix.sh" \ # Custom script for unknown failures
124- --dry-run # Preview without making changes
125- ```
126-
127- ### Example Output
128-
129- ```
130- Diagnosing failures...
131- Applying recovery heuristics...
132- Job 107 (train_model): OOM detected, increasing memory 8g -> 12g
133- Applied fixes: 1 OOM, 0 timeout
134- Resetting 1 job(s) for retry...
135- Reset 1 job(s)
136- Reinitializing workflow...
137- Regenerating Slurm schedulers...
138- Submitted Slurm allocation with 1 job
139-
140- Recovery complete for workflow 42
141- - 1 job(s) had memory increased
142- Reset 1 job(s). Slurm schedulers regenerated and submitted.
201+ --dry-run \ # Preview without making changes
202+ --no-prompts # Skip interactive wizard
143203```
144204
145205## The ` torc watch --recover ` Command
@@ -263,13 +323,13 @@ With default settings:
263323
264324## Choosing the Right Command
265325
266- | Use Case | Command |
267- | --------------------------------- | ------------------------ |
268- | One-shot recovery after failure | ` torc recover ` |
269- | Continuous monitoring | ` torc watch -r ` |
270- | Preview what recovery would do | ` torc recover --dry-run ` |
271- | Production long-running workflows | ` torc watch -r ` |
272- | Manual investigation, then retry | ` torc recover ` |
326+ | Use Case | Command |
327+ | ---------------------------------- | --- ------------------------ |
328+ | Interactive recovery after failure | ` torc recover ` |
329+ | Automatic recovery (scripting) | ` torc recover --no-prompts ` |
330+ | Continuous monitoring | ` torc watch -r ` |
331+ | Preview what recovery would do | ` torc recover --dry-run ` |
332+ | Production long-running workflows | ` torc watch -r ` |
273333
274334## Complete Workflow Example
275335
@@ -530,16 +590,16 @@ If jobs are requesting more resources than partitions allow:
5305902 . Use smaller multipliers
5315913 . Consider splitting jobs into smaller pieces
532592
533- ## Comparison: Automatic vs Manual Recovery
593+ ## Comparison: Interactive vs Automatic vs AI-Assisted Recovery
534594
535- | Feature | Automatic | Manual/ AI-Assisted |
536- | ---------------------- | -------------------- | ----------------------- |
537- | Human involvement | None | Interactive |
538- | Speed | Fast | Depends on human |
539- | Handles OOM/timeout | Yes | Yes |
540- | Handles unknown errors | Retry only | Full investigation |
541- | Cost optimization | Basic | Can be sophisticated |
542- | Use case | Production workflows | Debugging, optimization |
595+ | Feature | Interactive ( ` torc recover ` ) | Automatic ( ` --no-prompts ` ) | AI-Assisted |
596+ | ---------------------- | ---------------------------- | -------------------------- | ------------- |
597+ | Human involvement | Guided wizard | None | AI + human |
598+ | Speed | Minutes | Fast | Varies |
599+ | Handles OOM/timeout | Yes | Yes | Yes |
600+ | Handles unknown errors | User chooses | Retry only | Investigation |
601+ | Scheduler control | Choose or auto-generate | Auto-generate | Manual |
602+ | Use case | Most recovery scenarios | Scripting, ` torc watch ` | Complex bugs |
543603
544604## Implementation Details
545605
0 commit comments