66
77## Overview
88
9- | Mode | Description |
10- | -------- | ------------------------------------------------------------------------------- |
11- | ` direct ` | Torc manages job execution directly without Slurm step wrapping |
12- | ` slurm ` | Jobs are wrapped with ` srun ` , letting Slurm manage resources and termination |
13- | ` auto ` | Automatically selects ` slurm ` if ` SLURM_JOB_ID ` is set, otherwise uses ` direct ` |
9+ | Mode | Description |
10+ | -------- | ---------------------------------------------------------------------------- |
11+ | ` direct ` | Torc manages job execution directly without Slurm step wrapping (default) |
12+ | ` slurm ` | Jobs are wrapped with ` srun ` , letting Slurm manage resources and termination |
13+ | ` auto ` | Selects ` slurm ` if ` SLURM_JOB_ID ` is set, otherwise ` direct ` |
1414
1515Configure the execution mode in your workflow specification:
1616
@@ -19,33 +19,45 @@ execution_config:
1919 mode : direct # or "slurm" or "auto"
2020` ` `
2121
22+ > **Warning**: Use ` auto` with caution. If your workflow runs inside a Slurm allocation (where
23+ > `SLURM_JOB_ID` is set), `auto` will silently select slurm mode, which wraps every job with `srun`.
24+ > This may not be what you want if your workflow is designed for direct execution. Prefer setting
25+ > the mode explicitly to avoid surprises.
26+
2227# # When to Use Each Mode
2328
24- ### Direct Mode
29+ # ## Direct Mode (Default)
2530
26- Use direct mode when:
31+ Direct mode is the default and works everywhere : local machines, cloud VMs, containers, and inside
32+ Slurm allocations. Use direct mode when :
2733
2834- Running jobs **outside of Slurm** (local machine, cloud VMs, containers)
29- - Running inside Slurm but **srun is unreliable** or has compatibility issues
30- - You want Torc to **enforce memory limits** via OOM detection
31- - You need **custom termination signals** (e.g., SIGINT for graceful shutdown)
32-
33- ### Slurm Mode
35+ - Running inside Slurm but **srun has compatibility issues** with your environment
36+ - You want the **simplest, most portable** configuration
37+ - You want to run jobs **without resource limits** (`limit_resources : false`) to explore resource
38+ requirements for new workloads
3439
35- Use slurm mode when:
40+ Direct mode is recommended as the starting point for most workflows. It avoids the overhead of
41+ creating Slurm job steps and works consistently across different HPC sites with varying Slurm
42+ configurations.
3643
37- - Running inside a **Slurm allocation** and want full Slurm integration
38- - You want Slurm's **cgroup-based resource enforcement**
39- - You need **sacct accounting** for job steps
40- - HPC admins need **visibility into job steps** via Slurm tools
44+ # ## Slurm Mode
4145
42- ### Auto Mode (Default)
46+ Use slurm mode when you need features that only Slurm can provide :
4347
44- Auto mode is the default and works well for most use cases:
48+ - **Hardware-level resource control**: Slurm's cgroup enforcement can be more precise than Torc's
49+ process-level monitoring, especially for GPU isolation and CPU binding on newer hardware
50+ - **Per-job accounting**: Each job appears as a separate step in `sacct`, giving detailed resource
51+ usage breakdowns per job rather than a single entry for the whole Torc worker allocation
52+ - **Admin visibility**: HPC admins can see and manage individual job steps via Slurm tools
53+ (`squeue`, `sacct`, `scontrol`), which is useful for debugging and auditing
54+ - **Cgroup-based memory enforcement**: Slurm's cgroup limits provide hard memory boundaries with no
55+ sampling delay, compared to Torc's periodic polling in direct mode
56+ - **CPU binding**: `srun` can bind tasks to specific CPU cores (`enable_cpu_bind: true`), which may
57+ improve cache locality for CPU-intensive workloads
4558
46- - Detects Slurm by checking for ` SLURM_JOB_ID` environment variable
47- - Uses `slurm` mode inside allocations, `direct` mode outside
48- - No configuration needed for portable workflows
59+ > **Note**: Some HPC sites may prefer one mode over the other. Check with your site admins if you
60+ > are uncertain which mode to use.
4961
5062# # Direct Mode
5163
@@ -190,15 +202,21 @@ execution_config:
190202
191203The `sigkill_headroom_seconds` setting creates a buffer between step timeouts and allocation end :
192204
193- ` ` `
194- Allocation start Allocation end
195- | |
196- | [-------- Job step runs --------] |
197- | ↑ |
198- | Step timeout |
199- | (--time=remaining - headroom) |
200- | |
201- |<---------------- sigkill_headroom_seconds ------------>|
205+ ` ` ` mermaid
206+ gantt
207+ title Step Timeout vs Allocation End
208+ dateFormat X
209+ axisFormat %s
210+
211+ section Allocation
212+ Full allocation :active, 0, 100
213+
214+ section Job Step
215+ Job step runs :done, 0, 70
216+
217+ section Timing
218+ sigkill_headroom_seconds :crit, 70, 100
219+ Step timeout (--time=remaining - headroom) :milestone, 70, 70
202220` ` `
203221
204222This ensures :
@@ -254,11 +272,11 @@ execution_config:
254272 limit_resources: false # Don't enforce limits during development
255273` ` `
256274
257- # ## Production HPC
275+ # ## Production HPC (with Slurm integration)
258276
259277` ` ` yaml
260278execution_config:
261- mode: auto # Use slurm inside allocations
279+ mode: slurm
262280 srun_termination_signal: TERM@120
263281 sigkill_headroom_seconds: 300
264282` ` `
0 commit comments