Skip to content

Commit cd2d1bb

Browse files
daniel-thomclaude
andauthored
Enhance the watcher (#222)
* Support walltime and partition inputs to the watcher Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a2dcd1a commit cd2d1bb

File tree

10 files changed

+744
-120
lines changed

10 files changed

+744
-120
lines changed

docs/src/core/reference/cli.md

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ resource requirements, and resubmits jobs.
180180
```
181181

182182
Automatically diagnoses OOM/timeout failures, adjusts resources, and retries. Runs until all jobs
183-
complete or max retries exceeded.
183+
complete. Use `--max-retries` to limit recovery attempts.
184184

185185
3. **With auto-scheduling** (`--auto-schedule`):
186186

@@ -207,7 +207,7 @@ resource requirements, and resubmits jobs.
207207
**Recovery:**
208208

209209
- `-r`, `--recover` — Enable automatic failure recovery
210-
- `-m`, `--max-retries <MAX_RETRIES>` — Maximum number of recovery attempts. Default: `3`
210+
- `-m`, `--max-retries <MAX_RETRIES>` — Maximum number of recovery attempts. Default: unlimited
211211
- `--memory-multiplier <MEMORY_MULTIPLIER>` — Memory multiplier for OOM failures. Default: `1.5`
212212
- `--runtime-multiplier <RUNTIME_MULTIPLIER>` — Runtime multiplier for timeout failures. Default:
213213
`1.5`
@@ -225,6 +225,14 @@ resource requirements, and resubmits jobs.
225225
- `--auto-schedule-stranded-timeout <SECONDS>` — Schedule stranded jobs after this timeout even if
226226
below threshold. Default: `7200` (2 hrs). Set to `0` to disable.
227227

228+
**Scheduler overrides:**
229+
230+
- `--partition <PARTITION>` — Fixed Slurm partition for regenerated schedulers. Bypasses automatic
231+
partition selection. Node count is still calculated dynamically.
232+
- `--walltime <WALLTIME>` — Fixed Slurm walltime for regenerated schedulers (format: `HH:MM:SS` or
233+
`D-HH:MM:SS`). Bypasses automatic walltime calculation. Node count is still calculated
234+
dynamically.
235+
228236
### Auto-Scheduling Behavior
229237

230238
When `--auto-schedule` is enabled:
@@ -261,6 +269,10 @@ torc watch 123 --auto-schedule \
261269
--auto-schedule-threshold 10 \
262270
--auto-schedule-cooldown 3600 \
263271
--auto-schedule-stranded-timeout 14400
272+
273+
# Fixed partition and walltime (dynamic node count only)
274+
# Useful for long-running checkpointable jobs
275+
torc watch 123 --auto-schedule --partition standard --walltime 04:00:00
264276
```
265277

266278
### See Also
@@ -1701,6 +1713,10 @@ regenerate schedulers to submit new allocations.
17011713

17021714
- `--account <ACCOUNT>` — Slurm account to use (defaults to account from existing schedulers)
17031715
- `--profile <PROFILE>` — HPC profile to use (if not specified, tries to detect current system)
1716+
- `--partition <PARTITION>` — Fixed Slurm partition (bypasses automatic partition selection). Node
1717+
count is still calculated dynamically.
1718+
- `--walltime <WALLTIME>` — Fixed Slurm walltime (format: `HH:MM:SS` or `D-HH:MM:SS`). Bypasses
1719+
automatic walltime calculation. Node count is still calculated dynamically.
17041720
- `--single-allocation` — Bundle all nodes into a single Slurm allocation per scheduler
17051721
- `--submit` — Submit the generated allocations immediately
17061722
- `-o`, `--output-dir <OUTPUT_DIR>` — Output directory for job output files (used when submitting).
@@ -1710,9 +1726,9 @@ regenerate schedulers to submit new allocations.
17101726
- `--group-by <GROUP_BY>` — Strategy for grouping jobs into schedulers. Possible values:
17111727
`resource-requirements` (default), `partition`
17121728
- `--walltime-strategy <STRATEGY>` — Strategy for determining Slurm job walltime. Possible values:
1713-
`max-job-runtime` (default), `max-partition-time`
1729+
`max-job-runtime` (default), `max-partition-time`. Ignored when `--walltime` is set.
17141730
- `--walltime-multiplier <MULTIPLIER>` — Multiplier for job runtime when using
1715-
`--walltime-strategy=max-job-runtime`. Default: `1.5`
1731+
`--walltime-strategy=max-job-runtime`. Default: `1.5`. Ignored when `--walltime` is set.
17161732
- `--dry-run` — Show what would be created without making changes
17171733
- `--include-job-ids <JOB_IDS>` — Include specific job IDs in planning regardless of their status
17181734
(useful for recovery dry-run to include failed jobs)

docs/src/specialized/fault-tolerance/automatic-recovery.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -157,14 +157,14 @@ This will:
157157
3. Adjust resource requirements based on heuristics
158158
4. Reset failed jobs and submit new Slurm allocations
159159
5. Resume monitoring
160-
6. Repeat until success or max retries exceeded
160+
6. Repeat until success (or max retries exceeded, if `--max-retries` is set)
161161

162162
### Options
163163

164164
```bash
165165
torc watch <workflow_id> \
166166
-r \ # Enable automatic recovery (--recover)
167-
-m 3 \ # Maximum recovery attempts (--max-retries)
167+
-m 5 \ # Optional: limit recovery attempts (--max-retries)
168168
--memory-multiplier 1.5 \ # Memory increase factor for OOM
169169
--runtime-multiplier 1.5 \ # Runtime increase factor for timeout
170170
--retry-unknown \ # Also retry jobs with unknown failures
@@ -175,7 +175,9 @@ torc watch <workflow_id> \
175175
--auto-schedule \ # Automatically schedule nodes for stranded jobs
176176
--auto-schedule-threshold 5 \ # Min retry jobs before scheduling (default: 5)
177177
--auto-schedule-cooldown 1800 \ # Seconds between auto-schedule attempts (default: 1800)
178-
--auto-schedule-stranded-timeout 7200 # Schedule stranded jobs after this time (default: 7200)
178+
--auto-schedule-stranded-timeout 7200 \ # Schedule stranded jobs after this time (default: 7200)
179+
--partition standard \ # Fixed Slurm partition (bypass auto-detection)
180+
--walltime 04:00:00 # Fixed walltime (bypass auto-calculation)
179181
```
180182

181183
### Custom Recovery Hooks
@@ -287,7 +289,7 @@ Submitted to Slurm with 10 allocations
287289
### 2. Start Watching with Auto-Recovery
288290

289291
```bash
290-
torc watch 42 --recover --max-retries 3 --show-job-counts
292+
torc watch 42 --recover --show-job-counts
291293
```
292294

293295
> **Note:** The `--show-job-counts` flag is optional. Without it, the command polls silently until
@@ -296,7 +298,7 @@ torc watch 42 --recover --max-retries 3 --show-job-counts
296298
Output:
297299

298300
```
299-
Watching workflow 42 (poll interval: 60s, recover enabled, max retries: 3, job counts enabled)
301+
Watching workflow 42 (poll interval: 60s, recover enabled, unlimited retries, job counts enabled)
300302
completed=0, running=10, pending=0, failed=0, blocked=90
301303
completed=25, running=10, pending=0, failed=0, blocked=65
302304
...
@@ -309,7 +311,7 @@ Workflow completed with failures:
309311
- Terminated: 0
310312
- Completed: 95
311313
312-
Attempting automatic recovery (attempt 1/3)
314+
Attempting automatic recovery (attempt 1)
313315
314316
Diagnosing failures...
315317
Applying recovery heuristics...
@@ -325,7 +327,7 @@ Regenerating Slurm schedulers and submitting...
325327
326328
Recovery initiated. Resuming monitoring...
327329
328-
Watching workflow 42 (poll interval: 60s, recover enabled, max retries: 3, job counts enabled)
330+
Watching workflow 42 (poll interval: 60s, recover enabled, unlimited retries, job counts enabled)
329331
completed=95, running=5, pending=0, failed=0, blocked=0
330332
...
331333
Workflow 42 is complete
@@ -350,7 +352,7 @@ This prevents wasting allocation time on jobs that likely have script or data bu
350352

351353
### 4. If Max Retries Exceeded
352354

353-
If failures persist after max retries:
355+
If `--max-retries` is set and failures persist after that many attempts:
354356

355357
```
356358
Max retries (3) exceeded. Manual intervention required.
@@ -417,13 +419,16 @@ Set initial resource requests lower and let auto-recovery increase them:
417419
- Only failing jobs get increased resources
418420
- Avoids wasting HPC resources on over-provisioned jobs
419421

420-
### 2. Set Reasonable Max Retries
422+
### 2. Set Max Retries When Appropriate
423+
424+
By default, `torc watch` retries indefinitely until the workflow succeeds. Use `--max-retries` to
425+
limit recovery attempts if needed:
421426

422427
```bash
423-
--max-retries 3 # Good for most workflows
428+
--max-retries 5 # Limit to 5 recovery attempts
424429
```
425430

426-
Too many retries can waste allocation time on jobs that will never succeed.
431+
This can prevent wasting allocation time on jobs that will never succeed.
427432

428433
### 3. Use Appropriate Multipliers
429434

@@ -551,7 +556,7 @@ If jobs are requesting more resources than partitions allow:
551556
- Run `torc slurm regenerate --submit`
552557
- Increment retry counter
553558
- Resume polling
554-
5. Exit 0 on success, exit 1 on max retries exceeded
559+
5. Exit 0 on success, exit 1 on max retries exceeded (if `--max-retries` is set)
555560

556561
### The Regenerate Command Flow
557562

src/cli.rs

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -400,9 +400,9 @@ SEE ALSO:
400400
#[arg(short, long)]
401401
recover: bool,
402402

403-
/// Maximum number of recovery attempts
404-
#[arg(short, long, default_value = "3")]
405-
max_retries: u32,
403+
/// Maximum number of recovery attempts (unlimited if not set)
404+
#[arg(short, long)]
405+
max_retries: Option<u32>,
406406

407407
/// Memory multiplier for OOM failures (default: 1.5 = 50% increase)
408408
#[arg(long, default_value = "1.5")]
@@ -511,6 +511,23 @@ SEE ALSO:
511511
/// claude - Claude Code CLI (default)
512512
#[arg(long, default_value = "claude", verbatim_doc_comment)]
513513
ai_agent: String,
514+
515+
/// Fixed Slurm partition for regenerated schedulers
516+
///
517+
/// When set, all regenerated schedulers (from --auto-schedule or --recover)
518+
/// use this partition instead of auto-detecting the best partition from job
519+
/// resource requirements. The number of compute nodes is still calculated
520+
/// dynamically based on pending jobs.
521+
#[arg(long)]
522+
partition: Option<String>,
523+
524+
/// Fixed Slurm walltime for regenerated schedulers (format: HH:MM:SS or D-HH:MM:SS)
525+
///
526+
/// When set, all regenerated schedulers (from --auto-schedule or --recover)
527+
/// use this walltime instead of calculating it from job runtimes. The number
528+
/// of compute nodes is still calculated dynamically.
529+
#[arg(long)]
530+
walltime: Option<String>,
514531
},
515532
/// Recover a Slurm workflow from failures
516533
///

src/client/commands/recover.rs

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -369,7 +369,7 @@ pub fn recover_workflow(
369369

370370
// Step 7: Regenerate Slurm schedulers and submit
371371
info!("Schedulers regenerating workflow_id={}", args.workflow_id);
372-
regenerate_and_submit(args.workflow_id, &args.output_dir)?;
372+
regenerate_and_submit(args.workflow_id, &args.output_dir, None, None)?;
373373

374374
Ok(result)
375375
}
@@ -942,16 +942,30 @@ pub fn run_recovery_hook(workflow_id: i64, hook_command: &str) -> Result<(), Str
942942
}
943943

944944
/// Regenerate Slurm schedulers and submit allocations
945-
pub fn regenerate_and_submit(workflow_id: i64, output_dir: &Path) -> Result<(), String> {
945+
pub fn regenerate_and_submit(
946+
workflow_id: i64,
947+
output_dir: &Path,
948+
partition: Option<&str>,
949+
walltime: Option<&str>,
950+
) -> Result<(), String> {
951+
let mut args = vec![
952+
"slurm".to_string(),
953+
"regenerate".to_string(),
954+
workflow_id.to_string(),
955+
"--submit".to_string(),
956+
"-o".to_string(),
957+
output_dir.to_str().unwrap_or("torc_output").to_string(),
958+
];
959+
if let Some(p) = partition {
960+
args.push("--partition".to_string());
961+
args.push(p.to_string());
962+
}
963+
if let Some(w) = walltime {
964+
args.push("--walltime".to_string());
965+
args.push(w.to_string());
966+
}
946967
let output = Command::new("torc")
947-
.args([
948-
"slurm",
949-
"regenerate",
950-
&workflow_id.to_string(),
951-
"--submit",
952-
"-o",
953-
output_dir.to_str().unwrap_or("torc_output"),
954-
])
968+
.args(&args)
955969
.output()
956970
.map_err(|e| format!("Failed to run slurm regenerate: {}", e))?;
957971

src/client/commands/slurm.rs

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -607,6 +607,22 @@ EXAMPLES:
607607
#[arg(long)]
608608
profile: Option<String>,
609609

610+
/// Fixed Slurm partition (bypasses automatic partition selection)
611+
///
612+
/// When set, all regenerated schedulers use this partition instead of
613+
/// auto-detecting the best partition from job resource requirements.
614+
/// Node count is still calculated dynamically.
615+
#[arg(long)]
616+
partition: Option<String>,
617+
618+
/// Fixed Slurm walltime (bypasses automatic walltime calculation)
619+
///
620+
/// When set, all regenerated schedulers use this walltime instead of
621+
/// calculating it from job runtimes. Format: HH:MM:SS or D-HH:MM:SS.
622+
/// Node count is still calculated dynamically.
623+
#[arg(long)]
624+
walltime: Option<String>,
625+
610626
/// Bundle all nodes into a single Slurm allocation per scheduler
611627
#[arg(long)]
612628
single_allocation: bool,
@@ -627,6 +643,8 @@ EXAMPLES:
627643
/// Longer walltime allows more sequential jobs per allocation, reducing
628644
/// the total number of allocations. However, longer walltime requests
629645
/// may receive lower queue priority from the scheduler.
646+
///
647+
/// Ignored when --walltime is set.
630648
#[arg(long, value_enum, default_value_t = WalltimeStrategy::MaxJobRuntime)]
631649
walltime_strategy: WalltimeStrategy,
632650

@@ -635,6 +653,8 @@ EXAMPLES:
635653
/// The maximum job runtime is multiplied by this value to provide a safety
636654
/// margin. For example, 1.5 means requesting 50% more time than the longest
637655
/// job estimate.
656+
///
657+
/// Ignored when --walltime is set.
638658
#[arg(long, default_value = "1.5")]
639659
walltime_multiplier: f64,
640660

@@ -732,7 +752,9 @@ pub fn generate_schedulers_for_workflow(
732752
spec.actions = None;
733753
}
734754

735-
use crate::client::scheduler_plan::{apply_plan_to_spec, generate_scheduler_plan};
755+
use crate::client::scheduler_plan::{
756+
SchedulerOverrides, apply_plan_to_spec, generate_scheduler_plan,
757+
};
736758

737759
// Save original jobs and files before expansion so we can restore them later
738760
let original_jobs = spec.jobs.clone();
@@ -782,6 +804,7 @@ pub fn generate_schedulers_for_workflow(
782804
add_actions,
783805
None, // No suffix for regular generation (uses "_scheduler")
784806
false, // Not a recovery scenario
807+
&SchedulerOverrides::default(),
785808
);
786809

787810
// Combine warnings
@@ -1382,6 +1405,8 @@ pub fn handle_slurm_commands(config: &Configuration, command: &SlurmCommands, fo
13821405
workflow_id,
13831406
account,
13841407
profile: profile_name,
1408+
partition,
1409+
walltime,
13851410
single_allocation,
13861411
group_by,
13871412
walltime_strategy,
@@ -1402,6 +1427,8 @@ pub fn handle_slurm_commands(config: &Configuration, command: &SlurmCommands, fo
14021427
*workflow_id,
14031428
account.as_deref(),
14041429
profile_name.as_deref(),
1430+
partition.as_deref(),
1431+
walltime.as_deref(),
14051432
*single_allocation,
14061433
*group_by,
14071434
*walltime_strategy,
@@ -3376,6 +3403,8 @@ fn handle_regenerate(
33763403
workflow_id: i64,
33773404
account: Option<&str>,
33783405
profile_name: Option<&str>,
3406+
partition: Option<&str>,
3407+
walltime: Option<&str>,
33793408
single_allocation: bool,
33803409
group_by: GroupByStrategy,
33813410
walltime_strategy: WalltimeStrategy,
@@ -3585,7 +3614,18 @@ fn handle_regenerate(
35853614
std::process::exit(1);
35863615
});
35873616

3588-
use crate::client::scheduler_plan::generate_scheduler_plan;
3617+
use crate::client::scheduler_plan::{SchedulerOverrides, generate_scheduler_plan};
3618+
3619+
// Build overrides from partition/walltime arguments
3620+
let overrides = SchedulerOverrides {
3621+
partition: partition.map(|s| s.to_string()),
3622+
walltime_secs: walltime.map(|w| {
3623+
parse_walltime_secs(w).unwrap_or_else(|e| {
3624+
eprintln!("Error: invalid --walltime '{}': {}", w, e);
3625+
std::process::exit(1);
3626+
})
3627+
}),
3628+
};
35893629

35903630
// Build WorkflowGraph from pending jobs for proper dependency-aware grouping
35913631
// This aligns with create-slurm's behavior of separating jobs by (rr, has_dependencies)
@@ -3628,6 +3668,7 @@ fn handle_regenerate(
36283668
true, // add_actions (we'll create them as recovery actions)
36293669
Some(&format!("regen_{}", timestamp)),
36303670
true, // is_recovery
3671+
&overrides,
36313672
);
36323673

36333674
// Combine warnings from planning

0 commit comments

Comments
 (0)