Skip to content

Commit ca06651

Browse files
authored
Change default execution mode to direct (#242)
* Change default execution mode to direct The Slurm execution mode does not work in all HPC environments. - In one case the sacct command submitted after job completion always failed. - In another case the HPC admin stated a strong preference to use direct mode in order to reduce the load on the Slurm servers.
1 parent 01c2a11 commit ca06651

23 files changed

+169
-75
lines changed

docs/src/core/concepts/execution-modes.md

Lines changed: 51 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ Slurm.
66

77
## Overview
88

9-
| Mode | Description |
10-
| -------- | ------------------------------------------------------------------------------- |
11-
| `direct` | Torc manages job execution directly without Slurm step wrapping |
12-
| `slurm` | Jobs are wrapped with `srun`, letting Slurm manage resources and termination |
13-
| `auto` | Automatically selects `slurm` if `SLURM_JOB_ID` is set, otherwise uses `direct` |
9+
| Mode | Description |
10+
| -------- | ---------------------------------------------------------------------------- |
11+
| `direct` | Torc manages job execution directly without Slurm step wrapping (default) |
12+
| `slurm` | Jobs are wrapped with `srun`, letting Slurm manage resources and termination |
13+
| `auto` | Selects `slurm` if `SLURM_JOB_ID` is set, otherwise `direct` |
1414

1515
Configure the execution mode in your workflow specification:
1616

@@ -19,33 +19,45 @@ execution_config:
1919
mode: direct # or "slurm" or "auto"
2020
```
2121
22+
> **Warning**: Use `auto` with caution. If your workflow runs inside a Slurm allocation (where
23+
> `SLURM_JOB_ID` is set), `auto` will silently select slurm mode, which wraps every job with `srun`.
24+
> This may not be what you want if your workflow is designed for direct execution. Prefer setting
25+
> the mode explicitly to avoid surprises.
26+
2227
## When to Use Each Mode
2328

24-
### Direct Mode
29+
### Direct Mode (Default)
2530

26-
Use direct mode when:
31+
Direct mode is the default and works everywhere: local machines, cloud VMs, containers, and inside
32+
Slurm allocations. Use direct mode when:
2733

2834
- Running jobs **outside of Slurm** (local machine, cloud VMs, containers)
29-
- Running inside Slurm but **srun is unreliable** or has compatibility issues
30-
- You want Torc to **enforce memory limits** via OOM detection
31-
- You need **custom termination signals** (e.g., SIGINT for graceful shutdown)
32-
33-
### Slurm Mode
35+
- Running inside Slurm but **srun has compatibility issues** with your environment
36+
- You want the **simplest, most portable** configuration
37+
- You want to run jobs **without resource limits** (`limit_resources: false`) to explore resource
38+
requirements for new workloads
3439

35-
Use slurm mode when:
40+
Direct mode is recommended as the starting point for most workflows. It avoids the overhead of
41+
creating Slurm job steps and works consistently across different HPC sites with varying Slurm
42+
configurations.
3643

37-
- Running inside a **Slurm allocation** and want full Slurm integration
38-
- You want Slurm's **cgroup-based resource enforcement**
39-
- You need **sacct accounting** for job steps
40-
- HPC admins need **visibility into job steps** via Slurm tools
44+
### Slurm Mode
4145

42-
### Auto Mode (Default)
46+
Use slurm mode when you need features that only Slurm can provide:
4347

44-
Auto mode is the default and works well for most use cases:
48+
- **Hardware-level resource control**: Slurm's cgroup enforcement can be more precise than Torc's
49+
process-level monitoring, especially for GPU isolation and CPU binding on newer hardware
50+
- **Per-job accounting**: Each job appears as a separate step in `sacct`, giving detailed resource
51+
usage breakdowns per job rather than a single entry for the whole Torc worker allocation
52+
- **Admin visibility**: HPC admins can see and manage individual job steps via Slurm tools
53+
(`squeue`, `sacct`, `scontrol`), which is useful for debugging and auditing
54+
- **Cgroup-based memory enforcement**: Slurm's cgroup limits provide hard memory boundaries with no
55+
sampling delay, compared to Torc's periodic polling in direct mode
56+
- **CPU binding**: `srun` can bind tasks to specific CPU cores (`enable_cpu_bind: true`), which may
57+
improve cache locality for CPU-intensive workloads
4558

46-
- Detects Slurm by checking for `SLURM_JOB_ID` environment variable
47-
- Uses `slurm` mode inside allocations, `direct` mode outside
48-
- No configuration needed for portable workflows
59+
> **Note**: Some HPC sites may prefer one mode over the other. Check with your site admins if you
60+
> are uncertain which mode to use.
4961

5062
## Direct Mode
5163

@@ -190,15 +202,21 @@ execution_config:
190202

191203
The `sigkill_headroom_seconds` setting creates a buffer between step timeouts and allocation end:
192204

193-
```
194-
Allocation start Allocation end
195-
| |
196-
| [-------- Job step runs --------] |
197-
| ↑ |
198-
| Step timeout |
199-
| (--time=remaining - headroom) |
200-
| |
201-
|<---------------- sigkill_headroom_seconds ------------>|
205+
```mermaid
206+
gantt
207+
title Step Timeout vs Allocation End
208+
dateFormat X
209+
axisFormat %s
210+
211+
section Allocation
212+
Full allocation :active, 0, 100
213+
214+
section Job Step
215+
Job step runs :done, 0, 70
216+
217+
section Timing
218+
sigkill_headroom_seconds :crit, 70, 100
219+
Step timeout (--time=remaining - headroom) :milestone, 70, 70
202220
```
203221

204222
This ensures:
@@ -254,11 +272,11 @@ execution_config:
254272
limit_resources: false # Don't enforce limits during development
255273
```
256274

257-
### Production HPC
275+
### Production HPC (with Slurm integration)
258276

259277
```yaml
260278
execution_config:
261-
mode: auto # Use slurm inside allocations
279+
mode: slurm
262280
srun_termination_signal: TERM@120
263281
sigkill_headroom_seconds: 300
264282
```

docs/src/core/reference/workflow-spec.md

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -179,18 +179,19 @@ creation time.
179179

180180
### Shared fields (both modes)
181181

182-
| Name | Type | Default | Description |
183-
| -------------------------- | --------------------------- | --------- | ------------------------------------------------------ |
184-
| `mode` | string | `"auto"` | Execution mode: `"direct"`, `"slurm"`, or `"auto"` |
185-
| `sigkill_headroom_seconds` | integer | `60` | Seconds before end_time for SIGKILL or srun --time |
186-
| `timeout_exit_code` | integer | `152` | Exit code for timed-out jobs (matches Slurm TIMEOUT) |
187-
| `staggered_start` | boolean | `true` | Stagger job runner startup to mitigate thundering herd |
188-
| `stdio` | [StdioConfig](#stdioconfig) | see below | Workflow-level default for stdout/stderr capture |
182+
| Name | Type | Default | Description |
183+
| -------------------------- | --------------------------- | ---------- | ------------------------------------------------------ |
184+
| `mode` | string | `"direct"` | Execution mode: `"direct"`, `"slurm"`, or `"auto"` |
185+
| `sigkill_headroom_seconds` | integer | `60` | Seconds before end_time for SIGKILL or srun --time |
186+
| `timeout_exit_code` | integer | `152` | Exit code for timed-out jobs (matches Slurm TIMEOUT) |
187+
| `staggered_start` | boolean | `true` | Stagger job runner startup to mitigate thundering herd |
188+
| `stdio` | [StdioConfig](#stdioconfig) | see below | Workflow-level default for stdout/stderr capture |
189189

190190
### Direct mode fields
191191

192-
These fields only apply when the effective mode is `direct`. Setting them with `mode: slurm` (or
193-
`mode: auto` with `slurm_schedulers`) produces a validation error.
192+
These fields only apply when the effective mode is `direct`. Setting them with `mode: slurm`
193+
produces a validation error. When `mode: auto`, validation checks the effective mode based on
194+
whether Slurm schedulers are present in the spec.
194195

195196
| Name | Type | Default | Description |
196197
| ---------------------- | ------- | ----------- | ----------------------------------------------------- |
@@ -201,8 +202,9 @@ These fields only apply when the effective mode is `direct`. Setting them with `
201202

202203
### Slurm mode fields
203204

204-
These fields only apply when the effective mode is `slurm`. Setting them with `mode: direct` (or
205-
`mode: auto` without `slurm_schedulers`) produces a validation error.
205+
These fields only apply when the effective mode is `slurm`. Setting them with `mode: direct`
206+
produces a validation error. When `mode: auto`, validation checks the effective mode based on
207+
whether Slurm schedulers are present in the spec.
206208

207209
| Name | Type | Default | Description |
208210
| ------------------------- | ------- | ------- | --------------------------------------- |
@@ -260,11 +262,14 @@ jobs:
260262
261263
### Execution Modes
262264
263-
| Mode | Description |
264-
| -------- | ------------------------------------------------------------------------------ |
265-
| `direct` | Torc manages job execution directly. Use outside Slurm or when srun unreliable |
266-
| `slurm` | Jobs wrapped with `srun`. Slurm manages resource limits and termination |
267-
| `auto` | Uses `slurm` if `SLURM_JOB_ID` is set, otherwise `direct` (default) |
265+
| Mode | Description |
266+
| -------- | ----------------------------------------------------------------------- |
267+
| `direct` | Torc manages job execution directly (default). Works everywhere |
268+
| `slurm` | Jobs wrapped with `srun`. Slurm manages resource limits and termination |
269+
| `auto` | Selects `slurm` if `SLURM_JOB_ID` is set, otherwise `direct` |
270+
271+
> **Warning**: `auto` will silently select slurm mode when running inside a Slurm allocation. Prefer
272+
> setting the mode explicitly to avoid unexpected behavior.
268273

269274
### Direct Mode Example
270275

docs/src/specialized/admin/configuration-files.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ export TORC_CLIENT__API_URL="${CI_TORC_SERVER}"
161161
export TORC_CLIENT__FORMAT="json"
162162

163163
torc run workflow.yaml
164-
result=$(torc status $WORKFLOW_ID | jq -r '.status')
164+
result=$(torc status $WORKFLOW_ID | jq -r '.is_complete')
165165
```
166166

167167
### HPC Cluster

docs/src/specialized/hpc/custom-hpc-profile.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -287,10 +287,12 @@ Now you can submit workflows using your custom profile:
287287

288288
```bash
289289
# Auto-detect the profile (if on the cluster)
290-
torc slurm generate --account my_project workflow.yaml && torc submit workflow.yaml
290+
torc slurm generate --account my_project -o workflow_slurm.yaml workflow.yaml
291+
torc submit workflow_slurm.yaml
291292

292293
# Or explicitly specify the profile
293-
torc slurm generate --account my_project --hpc-profile && torc submit --hpc-profile research workflow.yaml
294+
torc slurm generate --account my_project --profile research -o workflow_slurm.yaml workflow.yaml
295+
torc submit workflow_slurm.yaml
294296
```
295297

296298
## Advanced Configuration

docs/src/specialized/hpc/hpc-profiles.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,6 @@ HPC profiles are used by Slurm-related commands to automatically generate schedu
158158
See [Advanced Slurm Configuration](./slurm.md) for details on:
159159

160160
- `torc slurm generate` + `torc submit` - Submit workflows with auto-generated schedulers
161-
- `torc create-slurm` - Create workflows with auto-generated schedulers
162161

163162
## See Also
164163

docs/src/specialized/hpc/slurm-exit-codes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ $ torc slurm sacct $WORKFLOW_ID
3232
╰──────────────────────┴───────────────┴──────────╯
3333
```
3434

35-
**Fix:** Increase `memory` in resource requirements, or use
36-
`torc workflows check-resources --correct` to auto-adjust based on peak usage.
35+
**Fix:** Increase `memory` in resource requirements, or use `torc workflows correct-resources` to
36+
auto-adjust based on peak usage.
3737

3838
### Timeout with Graceful Shutdown (exit code 0)
3939

slurm-tests/workflows/cancel_workflow.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
name: cancel_workflow
88
description: Workflow cancellation test — cancel while jobs are running
99
project: slurm-tests
10+
execution_config:
11+
mode: slurm
1012

1113
resource_requirements:
1214
- name: sleep_resources

slurm-tests/workflows/failure_recovery.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ name: failure_recovery
99
description: Test workflow for Slurm job retry with failure handlers
1010
project: slurm-tests
1111
metadata: '{"test_type": "failure_recovery", "stages": 3}'
12+
execution_config:
13+
mode: slurm
1214

1315
failure_handlers:
1416
- name: retry_on_exit_42

slurm-tests/workflows/job_parallelism.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
name: job_parallelism
1515
description: Job-based parallelism — no resource requirements, controlled by --max-parallel-jobs
1616
project: slurm-tests
17+
execution_config:
18+
mode: slurm
1719

1820
resource_monitor:
1921
enabled: true

slurm-tests/workflows/multi_node_mpi_step.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
name: multi_node_mpi_step
88
description: Test srun step spanning two nodes (num_nodes=2)
99
project: slurm-tests
10+
execution_config:
11+
mode: slurm
1012

1113
resource_requirements:
1214
- name: mpi_resources

0 commit comments

Comments
 (0)