Change default execution mode to direct (#242)

daniel-thom · web-flow · commit ca06651142b2 · 2026-03-28T12:00:46.000-06:00
* Change default execution mode to direct

The Slurm execution mode does not work in all HPC environments.
- In one case the sacct command submitted after job completion always
  failed.
- In another case the HPC admin stated a strong preference to use direct
  mode in order to reduce the load on the Slurm servers.
diff --git a/docs/src/core/concepts/execution-modes.md b/docs/src/core/concepts/execution-modes.md
@@ -6,11 +6,11 @@ Slurm.
 
 ## Overview
 
-| Mode     | Description                                                                     |
-| -------- | ------------------------------------------------------------------------------- |
-| `direct` | Torc manages job execution directly without Slurm step wrapping                 |
-| `slurm`  | Jobs are wrapped with `srun`, letting Slurm manage resources and termination    |
-| `auto`   | Automatically selects `slurm` if `SLURM_JOB_ID` is set, otherwise uses `direct` |
+| Mode     | Description                                                                  |
+| -------- | ---------------------------------------------------------------------------- |
+| `direct` | Torc manages job execution directly without Slurm step wrapping (default)    |
+| `slurm`  | Jobs are wrapped with `srun`, letting Slurm manage resources and termination |
+| `auto`   | Selects `slurm` if `SLURM_JOB_ID` is set, otherwise `direct`                 |
 
 Configure the execution mode in your workflow specification:
 
@@ -19,33 +19,45 @@ execution_config:
   mode: direct  # or "slurm" or "auto"
 ```
 
+> **Warning**: Use `auto` with caution. If your workflow runs inside a Slurm allocation (where
+> `SLURM_JOB_ID` is set), `auto` will silently select slurm mode, which wraps every job with `srun`.
+> This may not be what you want if your workflow is designed for direct execution. Prefer setting
+> the mode explicitly to avoid surprises.
+
 ## When to Use Each Mode
 
-### Direct Mode
+### Direct Mode (Default)
 
-Use direct mode when:
+Direct mode is the default and works everywhere: local machines, cloud VMs, containers, and inside
+Slurm allocations. Use direct mode when:
 
 - Running jobs **outside of Slurm** (local machine, cloud VMs, containers)
-- Running inside Slurm but **srun is unreliable** or has compatibility issues
-- You want Torc to **enforce memory limits** via OOM detection
-- You need **custom termination signals** (e.g., SIGINT for graceful shutdown)
-
-### Slurm Mode
+- Running inside Slurm but **srun has compatibility issues** with your environment
+- You want the **simplest, most portable** configuration
+- You want to run jobs **without resource limits** (`limit_resources: false`) to explore resource
+  requirements for new workloads
 
-Use slurm mode when:
+Direct mode is recommended as the starting point for most workflows. It avoids the overhead of
+creating Slurm job steps and works consistently across different HPC sites with varying Slurm
+configurations.
 
-- Running inside a **Slurm allocation** and want full Slurm integration
-- You want Slurm's **cgroup-based resource enforcement**
-- You need **sacct accounting** for job steps
-- HPC admins need **visibility into job steps** via Slurm tools
+### Slurm Mode
 
-### Auto Mode (Default)
+Use slurm mode when you need features that only Slurm can provide:
 
-Auto mode is the default and works well for most use cases:
+- **Hardware-level resource control**: Slurm's cgroup enforcement can be more precise than Torc's
+  process-level monitoring, especially for GPU isolation and CPU binding on newer hardware
+- **Per-job accounting**: Each job appears as a separate step in `sacct`, giving detailed resource
+  usage breakdowns per job rather than a single entry for the whole Torc worker allocation
+- **Admin visibility**: HPC admins can see and manage individual job steps via Slurm tools
+  (`squeue`, `sacct`, `scontrol`), which is useful for debugging and auditing
+- **Cgroup-based memory enforcement**: Slurm's cgroup limits provide hard memory boundaries with no
+  sampling delay, compared to Torc's periodic polling in direct mode
+- **CPU binding**: `srun` can bind tasks to specific CPU cores (`enable_cpu_bind: true`), which may
+  improve cache locality for CPU-intensive workloads
 
-- Detects Slurm by checking for `SLURM_JOB_ID` environment variable
-- Uses `slurm` mode inside allocations, `direct` mode outside
-- No configuration needed for portable workflows
+> **Note**: Some HPC sites may prefer one mode over the other. Check with your site admins if you
+> are uncertain which mode to use.
 
 ## Direct Mode
 
@@ -190,15 +202,21 @@ execution_config:
 
 The `sigkill_headroom_seconds` setting creates a buffer between step timeouts and allocation end:
 
-```
-Allocation start                                        Allocation end
-    |                                                        |
-    |   [-------- Job step runs --------]                    |
-    |                                    ↑                   |
-    |                          Step timeout                  |
-    |                    (--time=remaining - headroom)       |
-    |                                                        |
-    |<---------------- sigkill_headroom_seconds ------------>|
+```mermaid
+gantt
+    title Step Timeout vs Allocation End
+    dateFormat X
+    axisFormat %s
+
+    section Allocation
+    Full allocation           :active, 0, 100
+
+    section Job Step
+    Job step runs             :done, 0, 70
+
+    section Timing
+    sigkill_headroom_seconds  :crit, 70, 100
+    Step timeout (--time=remaining - headroom) :milestone, 70, 70
 ```
 
 This ensures:
@@ -254,11 +272,11 @@ execution_config:
   limit_resources: false  # Don't enforce limits during development
 ```
 
-### Production HPC
+### Production HPC (with Slurm integration)
 
 ```yaml
 execution_config:
-  mode: auto  # Use slurm inside allocations
+  mode: slurm
   srun_termination_signal: TERM@120
   sigkill_headroom_seconds: 300
 ```
diff --git a/docs/src/core/reference/workflow-spec.md b/docs/src/core/reference/workflow-spec.md
@@ -179,18 +179,19 @@ creation time.
 
 ### Shared fields (both modes)
 
-| Name                       | Type                        | Default   | Description                                            |
-| -------------------------- | --------------------------- | --------- | ------------------------------------------------------ |
-| `mode`                     | string                      | `"auto"`  | Execution mode: `"direct"`, `"slurm"`, or `"auto"`     |
-| `sigkill_headroom_seconds` | integer                     | `60`      | Seconds before end_time for SIGKILL or srun --time     |
-| `timeout_exit_code`        | integer                     | `152`     | Exit code for timed-out jobs (matches Slurm TIMEOUT)   |
-| `staggered_start`          | boolean                     | `true`    | Stagger job runner startup to mitigate thundering herd |
-| `stdio`                    | [StdioConfig](#stdioconfig) | see below | Workflow-level default for stdout/stderr capture       |
+| Name                       | Type                        | Default    | Description                                            |
+| -------------------------- | --------------------------- | ---------- | ------------------------------------------------------ |
+| `mode`                     | string                      | `"direct"` | Execution mode: `"direct"`, `"slurm"`, or `"auto"`     |
+| `sigkill_headroom_seconds` | integer                     | `60`       | Seconds before end_time for SIGKILL or srun --time     |
+| `timeout_exit_code`        | integer                     | `152`      | Exit code for timed-out jobs (matches Slurm TIMEOUT)   |
+| `staggered_start`          | boolean                     | `true`     | Stagger job runner startup to mitigate thundering herd |
+| `stdio`                    | [StdioConfig](#stdioconfig) | see below  | Workflow-level default for stdout/stderr capture       |
 
 ### Direct mode fields
 
-These fields only apply when the effective mode is `direct`. Setting them with `mode: slurm` (or
-`mode: auto` with `slurm_schedulers`) produces a validation error.
+These fields only apply when the effective mode is `direct`. Setting them with `mode: slurm`
+produces a validation error. When `mode: auto`, validation checks the effective mode based on
+whether Slurm schedulers are present in the spec.
 
 | Name                   | Type    | Default     | Description                                           |
 | ---------------------- | ------- | ----------- | ----------------------------------------------------- |
@@ -201,8 +202,9 @@ These fields only apply when the effective mode is `direct`. Setting them with `
 
 ### Slurm mode fields
 
-These fields only apply when the effective mode is `slurm`. Setting them with `mode: direct` (or
-`mode: auto` without `slurm_schedulers`) produces a validation error.
+These fields only apply when the effective mode is `slurm`. Setting them with `mode: direct`
+produces a validation error. When `mode: auto`, validation checks the effective mode based on
+whether Slurm schedulers are present in the spec.
 
 | Name                      | Type    | Default | Description                             |
 | ------------------------- | ------- | ------- | --------------------------------------- |
@@ -260,11 +262,14 @@ jobs:
 
 ### Execution Modes
 
-| Mode     | Description                                                                    |
-| -------- | ------------------------------------------------------------------------------ |
-| `direct` | Torc manages job execution directly. Use outside Slurm or when srun unreliable |
-| `slurm`  | Jobs wrapped with `srun`. Slurm manages resource limits and termination        |
-| `auto`   | Uses `slurm` if `SLURM_JOB_ID` is set, otherwise `direct` (default)            |
+| Mode     | Description                                                             |
+| -------- | ----------------------------------------------------------------------- |
+| `direct` | Torc manages job execution directly (default). Works everywhere         |
+| `slurm`  | Jobs wrapped with `srun`. Slurm manages resource limits and termination |
+| `auto`   | Selects `slurm` if `SLURM_JOB_ID` is set, otherwise `direct`            |
+
+> **Warning**: `auto` will silently select slurm mode when running inside a Slurm allocation. Prefer
+> setting the mode explicitly to avoid unexpected behavior.
 
 ### Direct Mode Example
 
diff --git a/docs/src/specialized/admin/configuration-files.md b/docs/src/specialized/admin/configuration-files.md
@@ -161,7 +161,7 @@ export TORC_CLIENT__API_URL="${CI_TORC_SERVER}"
 export TORC_CLIENT__FORMAT="json"
 
 torc run workflow.yaml
-result=$(torc status $WORKFLOW_ID | jq -r '.status')
+result=$(torc status $WORKFLOW_ID | jq -r '.is_complete')
 ```
 
 ### HPC Cluster
diff --git a/docs/src/specialized/hpc/custom-hpc-profile.md b/docs/src/specialized/hpc/custom-hpc-profile.md
@@ -287,10 +287,12 @@ Now you can submit workflows using your custom profile:
 
 ```bash
 # Auto-detect the profile (if on the cluster)
-torc slurm generate --account my_project workflow.yaml && torc submit workflow.yaml
+torc slurm generate --account my_project -o workflow_slurm.yaml workflow.yaml
+torc submit workflow_slurm.yaml
 
 # Or explicitly specify the profile
-torc slurm generate --account my_project --hpc-profile && torc submit --hpc-profile research workflow.yaml
+torc slurm generate --account my_project --profile research -o workflow_slurm.yaml workflow.yaml
+torc submit workflow_slurm.yaml
 ```
 
 ## Advanced Configuration
diff --git a/docs/src/specialized/hpc/hpc-profiles.md b/docs/src/specialized/hpc/hpc-profiles.md
@@ -158,7 +158,6 @@ HPC profiles are used by Slurm-related commands to automatically generate schedu
 See [Advanced Slurm Configuration](./slurm.md) for details on:
 
 - `torc slurm generate` + `torc submit` - Submit workflows with auto-generated schedulers
-- `torc create-slurm` - Create workflows with auto-generated schedulers
 
 ## See Also
 
diff --git a/docs/src/specialized/hpc/slurm-exit-codes.md b/docs/src/specialized/hpc/slurm-exit-codes.md
@@ -32,8 +32,8 @@ $ torc slurm sacct $WORKFLOW_ID
 ╰──────────────────────┴───────────────┴──────────╯
 ```
 
-**Fix:** Increase `memory` in resource requirements, or use
-`torc workflows check-resources --correct` to auto-adjust based on peak usage.
+**Fix:** Increase `memory` in resource requirements, or use `torc workflows correct-resources` to
+auto-adjust based on peak usage.
 
 ### Timeout with Graceful Shutdown (exit code 0)
 
diff --git a/slurm-tests/workflows/cancel_workflow.yaml b/slurm-tests/workflows/cancel_workflow.yaml
@@ -7,6 +7,8 @@
 name: cancel_workflow
 description: Workflow cancellation test — cancel while jobs are running
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_requirements:
   - name: sleep_resources
diff --git a/slurm-tests/workflows/failure_recovery.yaml b/slurm-tests/workflows/failure_recovery.yaml
@@ -9,6 +9,8 @@ name: failure_recovery
 description: Test workflow for Slurm job retry with failure handlers
 project: slurm-tests
 metadata: '{"test_type": "failure_recovery", "stages": 3}'
+execution_config:
+  mode: slurm
 
 failure_handlers:
   - name: retry_on_exit_42
diff --git a/slurm-tests/workflows/job_parallelism.yaml b/slurm-tests/workflows/job_parallelism.yaml
@@ -14,6 +14,8 @@
 name: job_parallelism
 description: Job-based parallelism — no resource requirements, controlled by --max-parallel-jobs
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_monitor:
   enabled: true
diff --git a/slurm-tests/workflows/multi_node_mpi_step.yaml b/slurm-tests/workflows/multi_node_mpi_step.yaml
@@ -7,6 +7,8 @@
 name: multi_node_mpi_step
 description: Test srun step spanning two nodes (num_nodes=2)
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_requirements:
   - name: mpi_resources
diff --git a/slurm-tests/workflows/multi_node_parallel.yaml b/slurm-tests/workflows/multi_node_parallel.yaml
@@ -8,6 +8,7 @@ name: multi_node_parallel
 description: 2-node allocation — 40 jobs x 5 CPUs via stress-ng
 project: slurm-tests
 execution_config:
+  mode: slurm
   enable_cpu_bind: true
 
 resource_monitor:
diff --git a/slurm-tests/workflows/oom_detection.yaml b/slurm-tests/workflows/oom_detection.yaml
@@ -7,6 +7,8 @@
 name: oom_detection
 description: OOM detection test — one good job, one that exceeds memory
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_monitor:
   enabled: true
diff --git a/slurm-tests/workflows/resource_monitoring.yaml b/slurm-tests/workflows/resource_monitoring.yaml
@@ -7,6 +7,8 @@
 name: resource_monitoring
 description: Resource monitoring validation — CPU and memory usage captured
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_monitor:
   enabled: true
diff --git a/slurm-tests/workflows/single_node_basic.yaml b/slurm-tests/workflows/single_node_basic.yaml
@@ -6,6 +6,8 @@
 name: single_node_basic
 description: Single node with 3 sequential jobs (a -> b -> c)
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_requirements:
   - name: basic
diff --git a/slurm-tests/workflows/srun_termination_signal.yaml b/slurm-tests/workflows/srun_termination_signal.yaml
@@ -9,6 +9,7 @@ name: srun_termination_signal
 description: Verify srun_termination_signal delivers SIGTERM to user job before timeout
 project: slurm-tests
 execution_config:
+  mode: slurm
   srun_termination_signal: "TERM@120"
 
 resource_requirements:
diff --git a/slurm-tests/workflows/timeout_detection.yaml b/slurm-tests/workflows/timeout_detection.yaml
@@ -7,6 +7,8 @@
 name: timeout_detection
 description: Timeout detection test — one fast job, one that exceeds walltime
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_monitor:
   enabled: true
diff --git a/slurm-tests/workflows/watch_recover_oom.yaml b/slurm-tests/workflows/watch_recover_oom.yaml
@@ -6,6 +6,8 @@
 name: watch_recover_oom
 description: Test workflow for torc watch --recover with Slurm OOM recovery
 project: slurm-tests
+execution_config:
+  mode: slurm
 
 resource_monitor:
   enabled: true
diff --git a/src/cli.rs b/src/cli.rs
@@ -676,7 +676,7 @@ EXAMPLES:
         #[arg(required = true)]
         workflow_ids: Vec<i64>,
         /// Skip confirmation prompt
-        #[arg(long, short)]
+        #[arg(long)]
         force: bool,
     },
     /// Interactive terminal UI for managing workflows
diff --git a/src/client/commands/reports.rs b/src/client/commands/reports.rs
diff --git a/src/client/commands/slurm.rs b/src/client/commands/slurm.rs
diff --git a/src/client/workflow_spec.rs b/src/client/workflow_spec.rs
diff --git a/tests/test_execution_config.rs b/tests/test_execution_config.rs
diff --git a/tests/test_workflow_spec.rs b/tests/test_workflow_spec.rs