Skip to content

Commit 44bf397

Browse files
daniel-thomclaude
andauthored
Prohibit limit_resources: false in Slurm mode and validate execution_config fields (#228)
* Omit --exact from srun when limit_resources is false When limit_resources=false, srun was still passed --exact, which causes it to default --cpus-per-task to 1. This silently restricted multi-threaded jobs to a single CPU core via cgroups, causing significant slowdowns (45%+ observed) compared to direct execution mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * SQLite synchronous and increased cache_size * Increase max record transfer size from 10,000 to 100,000 Centralizes the limit as MAX_RECORD_TRANSFER_COUNT in lib.rs and updates all server pagination, client batch creation, and API defaults to use it. Also increases the default max request body size to 200 MiB to accommodate larger payloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Prohibit limit_resources: false in Slurm mode Slurm mode always requires resource limits for correct srun behavior. Setting limit_resources: false with mode: slurm now returns a validation error directing users to use direct mode instead. Removes the limit_resources=false code paths from srun argument building and updates documentation and tests accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Validate execution_config fields against effective mode Reject mode-incompatible fields at workflow creation time: direct-only fields (termination_signal, sigterm_lead_seconds, oom_exit_code) error in slurm mode, and slurm-only fields (srun_termination_signal, enable_cpu_bind) error in direct mode. Auto mode infers from slurm_schedulers presence. Restructures docs and struct comments to group fields by shared/direct/slurm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 884a13d commit 44bf397

File tree

29 files changed

+648
-332
lines changed

29 files changed

+648
-332
lines changed

api/openapi.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3051,7 +3051,7 @@ paths:
30513051
name: limit
30523052
required: false
30533053
schema:
3054-
default: 10000
3054+
default: 100000
30553055
type: integer
30563056
style: form
30573057
responses:
@@ -4169,7 +4169,7 @@ paths:
41694169
name: limit
41704170
required: false
41714171
schema:
4172-
default: 10000
4172+
default: 100000
41734173
type: integer
41744174
style: form
41754175
responses:
@@ -6388,7 +6388,7 @@ paths:
63886388
name: limit
63896389
required: false
63906390
schema:
6391-
default: 10000
6391+
default: 100000
63926392
type: integer
63936393
style: form
63946394
responses:

docs/src/core/concepts/execution-modes.md

Lines changed: 19 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -169,24 +169,18 @@ Key flags:
169169

170170
### Resource Enforcement in Slurm Mode
171171

172-
When `limit_resources: true`:
172+
Slurm mode always passes `--cpus-per-task`, `--mem`, and `--gpus` to srun. Slurm's cgroups enforce
173+
these limits: jobs exceeding memory are killed by Slurm with exit code 137.
173174

174-
- `--cpus-per-task`, `--mem`, and `--gpus` are passed to srun
175-
- Slurm's cgroups enforce these limits
176-
- Jobs exceeding memory are killed by Slurm with exit code 137
177-
178-
When `limit_resources: false`:
179-
180-
- CPU and memory flags are omitted
181-
- Jobs can use any available resources in the allocation
182-
- GPU flags are still passed (required for GPU access)
175+
> **Note**: `limit_resources: false` is not supported in Slurm mode. If you need to run jobs without
176+
> resource enforcement inside a Slurm allocation, use `mode: direct` instead. See
177+
> [Disabling Resource Limits](#disabling-resource-limits) below.
183178

184179
### Slurm Mode Configuration
185180

186181
```yaml
187182
execution_config:
188183
mode: slurm
189-
limit_resources: true # Pass resource limits to srun
190184
srun_termination_signal: TERM@120 # Send SIGTERM 120s before step timeout
191185
sigkill_headroom_seconds: 180 # End steps 3 minutes before allocation ends
192186
enable_cpu_bind: false # Set to true to enable Slurm CPU binding
@@ -215,26 +209,28 @@ This ensures:
215209

216210
## Disabling Resource Limits
217211

218-
Set `limit_resources: false` to disable resource enforcement:
212+
Set `limit_resources: false` to disable resource enforcement in direct mode:
219213

220214
```yaml
221215
execution_config:
222-
mode: direct # or slurm
216+
mode: direct
223217
limit_resources: false
224218
```
225219

226-
Effects:
220+
This is useful when exploring resource requirements for new jobs or during local development. Jobs
221+
can use any available system resources without being killed for exceeding their declared limits.
227222

228-
| Feature | limit_resources: true | limit_resources: false |
229-
| ---------------------- | ------------------------- | ----------------------- |
230-
| Memory limits (direct) | OOM detection and SIGKILL | No enforcement |
231-
| Memory limits (slurm) | srun --mem passed | --mem omitted |
232-
| CPU limits (slurm) | srun --cpus-per-task | --cpus-per-task omitted |
233-
| GPU allocation | Always passed to srun | Always passed to srun |
234-
| Timeout termination | Enforced | Enforced |
223+
Effects in direct mode:
235224

236-
Note: GPU allocation is always requested regardless of `limit_resources` because jobs need explicit
237-
GPU access in Slurm.
225+
| Feature | limit_resources: true | limit_resources: false |
226+
| ------------------- | ------------------------- | ---------------------- |
227+
| Memory limits | OOM detection and SIGKILL | No enforcement |
228+
| Timeout termination | Enforced | Enforced |
229+
230+
> **Important**: `limit_resources: false` is only supported in direct mode. Setting it with
231+
> `mode: slurm` will produce an error at workflow creation time. Slurm mode relies on `srun` with
232+
> `--exact`, `--cpus-per-task`, and `--mem` for correct concurrent job execution — omitting these
233+
> flags causes jobs to run sequentially instead of in parallel.
238234

239235
## Exit Codes
240236

@@ -263,7 +259,6 @@ execution_config:
263259
```yaml
264260
execution_config:
265261
mode: auto # Use slurm inside allocations
266-
limit_resources: true
267262
srun_termination_signal: TERM@120
268263
sigkill_headroom_seconds: 300
269264
```

docs/src/core/reference/cli.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -450,7 +450,7 @@ List workflows
450450

451451
- `-u`, `--user <USER>` — User to filter by (defaults to USER environment variable)
452452
- `--all-users` — List workflows for all users (overrides --user)
453-
- `-l`, `--limit <LIMIT>` — Maximum number of workflows to return. Default: `10000`
453+
- `-l`, `--limit <LIMIT>` — Maximum number of workflows to return. Default: `100000`
454454
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
455455
- `--sort-by <SORT_BY>` — Field to sort by
456456
- `--reverse-sort` — Reverse sort order
@@ -849,7 +849,7 @@ List compute nodes for a workflow
849849

850850
###### **Options:**
851851

852-
- `-l`, `--limit <LIMIT>` — Maximum number of compute nodes to return. Default: `10000`
852+
- `-l`, `--limit <LIMIT>` — Maximum number of compute nodes to return. Default: `100000`
853853
- `-o`, `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
854854
- `-s`, `--sort-by <SORT_BY>` — Field to sort by
855855
- `-r`, `--reverse-sort` — Reverse sort order. Default: `false`
@@ -898,7 +898,7 @@ List files
898898
###### **Options:**
899899

900900
- `--produced-by-job-id <PRODUCED_BY_JOB_ID>` — Filter by job ID that produced the files
901-
- `-l`, `--limit <LIMIT>` — Maximum number of files to return. Default: `10000`
901+
- `-l`, `--limit <LIMIT>` — Maximum number of files to return. Default: `100000`
902902
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
903903
- `--sort-by <SORT_BY>` — Field to sort by
904904
- `--reverse-sort` — Reverse sort order
@@ -1030,7 +1030,7 @@ List jobs
10301030

10311031
- `-s`, `--status <STATUS>` — Filter by job status
10321032
- `--upstream-job-id <UPSTREAM_JOB_ID>` — Filter by upstream job ID (jobs that depend on this job)
1033-
- `-l`, `--limit <LIMIT>` — Maximum number of jobs to return. Default: `10000`
1033+
- `-l`, `--limit <LIMIT>` — Maximum number of jobs to return. Default: `100000`
10341034
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
10351035
- `--sort-by <SORT_BY>` — Field to sort by
10361036
- `--reverse-sort` — Reverse sort order
@@ -1120,7 +1120,7 @@ List job-to-job dependencies for a workflow
11201120

11211121
###### **Options:**
11221122

1123-
- `-l`, `--limit <LIMIT>` — Maximum number of dependencies to return. Default: `10000`
1123+
- `-l`, `--limit <LIMIT>` — Maximum number of dependencies to return. Default: `100000`
11241124
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
11251125

11261126
## `torc job-dependencies job-file`
@@ -1135,7 +1135,7 @@ List job-file relationships for a workflow
11351135

11361136
###### **Options:**
11371137

1138-
- `-l`, `--limit <LIMIT>` — Maximum number of relationships to return. Default: `10000`
1138+
- `-l`, `--limit <LIMIT>` — Maximum number of relationships to return. Default: `100000`
11391139
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
11401140

11411141
## `torc job-dependencies job-user-data`
@@ -1150,7 +1150,7 @@ List job-user_data relationships for a workflow
11501150

11511151
###### **Options:**
11521152

1153-
- `-l`, `--limit <LIMIT>` — Maximum number of relationships to return. Default: `10000`
1153+
- `-l`, `--limit <LIMIT>` — Maximum number of relationships to return. Default: `100000`
11541154
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
11551155

11561156
## `torc resource-requirements`
@@ -1200,7 +1200,7 @@ List resource requirements
12001200

12011201
###### **Options:**
12021202

1203-
- `-l`, `--limit <LIMIT>` — Maximum number of resource requirements to return. Default: `10000`
1203+
- `-l`, `--limit <LIMIT>` — Maximum number of resource requirements to return. Default: `100000`
12041204
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
12051205
- `--sort-by <SORT_BY>` — Field to sort by
12061206
- `--reverse-sort` — Reverse sort order
@@ -1285,7 +1285,7 @@ List events for a workflow
12851285
###### **Options:**
12861286

12871287
- `-c`, `--category <CATEGORY>` — Filter events by category
1288-
- `-l`, `--limit <LIMIT>` — Maximum number of events to return. Default: `10000`
1288+
- `-l`, `--limit <LIMIT>` — Maximum number of events to return. Default: `100000`
12891289
- `-o`, `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
12901290
- `-s`, `--sort-by <SORT_BY>` — Field to sort by
12911291
- `-r`, `--reverse-sort` — Reverse sort order. Default: `false`
@@ -1357,7 +1357,7 @@ List results
13571357
- `--failed` — Show only failed jobs (non-zero return code)
13581358
- `-s`, `--status <STATUS>` — Filter by job status (uninitialized, blocked, canceled, terminated,
13591359
done, ready, scheduled, running, pending, disabled)
1360-
- `-l`, `--limit <LIMIT>` — Maximum number of results to return. Default: `10000`
1360+
- `-l`, `--limit <LIMIT>` — Maximum number of results to return. Default: `100000`
13611361
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
13621362
- `--sort-by <SORT_BY>` — Field to sort by
13631363
- `--reverse-sort` — Reverse sort order
@@ -1574,7 +1574,7 @@ Show the current Slurm configs in the database
15741574

15751575
###### **Options:**
15761576

1577-
- `-l`, `--limit <LIMIT>` — Maximum number of configs to return. Default: `10000`
1577+
- `-l`, `--limit <LIMIT>` — Maximum number of configs to return. Default: `100000`
15781578
- `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
15791579

15801580
## `torc slurm get`
@@ -1930,7 +1930,7 @@ List scheduled compute nodes for a workflow
19301930

19311931
###### **Options:**
19321932

1933-
- `-l`, `--limit <LIMIT>` — Maximum number of scheduled compute nodes to return. Default: `10000`
1933+
- `-l`, `--limit <LIMIT>` — Maximum number of scheduled compute nodes to return. Default: `100000`
19341934
- `-o`, `--offset <OFFSET>` — Offset for pagination (0-based). Default: `0`
19351935
- `-s`, `--sort-by <SORT_BY>` — Field to sort by
19361936
- `-r`, `--reverse-sort` — Reverse sort order. Default: `false`

docs/src/core/reference/workflow-spec.md

Lines changed: 35 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -173,21 +173,41 @@ Defines a Slurm HPC job scheduler configuration.
173173

174174
## ExecutionConfig
175175

176-
Controls how jobs are executed and terminated. Supports three modes for different execution
177-
environments.
178-
179-
| Name | Type | Default | Description |
180-
| -------------------------- | --------------------------- | ----------- | ------------------------------------------------------------- |
181-
| `mode` | string | `"auto"` | Execution mode: `"direct"`, `"slurm"`, or `"auto"` |
182-
| `limit_resources` | boolean | `true` | Enforce memory/CPU limits |
183-
| `termination_signal` | string | `"SIGTERM"` | Signal to send before SIGKILL (direct mode) |
184-
| `sigterm_lead_seconds` | integer | `30` | Seconds before SIGKILL to send termination signal |
185-
| `sigkill_headroom_seconds` | integer | `60` | Seconds before end_time for SIGKILL or srun --time adjustment |
186-
| `timeout_exit_code` | integer | `152` | Exit code for timed-out jobs (matches Slurm TIMEOUT) |
187-
| `oom_exit_code` | integer | `137` | Exit code for OOM-killed jobs (128 + SIGKILL) |
188-
| `srun_termination_signal` | string | none | Slurm signal spec for `srun --signal=<value>` |
189-
| `enable_cpu_bind` | boolean | `false` | Allow Slurm CPU binding |
190-
| `stdio` | [StdioConfig](#stdioconfig) | see below | Workflow-level default for stdout/stderr capture |
176+
Controls how jobs are executed and terminated. Fields are grouped by which execution mode they apply
177+
to. Setting a field that doesn't match the effective mode produces a validation error at workflow
178+
creation time.
179+
180+
### Shared fields (both modes)
181+
182+
| Name | Type | Default | Description |
183+
| -------------------------- | --------------------------- | --------- | ------------------------------------------------------ |
184+
| `mode` | string | `"auto"` | Execution mode: `"direct"`, `"slurm"`, or `"auto"` |
185+
| `sigkill_headroom_seconds` | integer | `60` | Seconds before end_time for SIGKILL or srun --time |
186+
| `timeout_exit_code` | integer | `152` | Exit code for timed-out jobs (matches Slurm TIMEOUT) |
187+
| `staggered_start` | boolean | `true` | Stagger job runner startup to mitigate thundering herd |
188+
| `stdio` | [StdioConfig](#stdioconfig) | see below | Workflow-level default for stdout/stderr capture |
189+
190+
### Direct mode fields
191+
192+
These fields only apply when the effective mode is `direct`. Setting them with `mode: slurm` (or
193+
`mode: auto` with `slurm_schedulers`) produces a validation error.
194+
195+
| Name | Type | Default | Description |
196+
| ---------------------- | ------- | ----------- | ----------------------------------------------------- |
197+
| `limit_resources` | boolean | `true` | Monitor memory/CPU and kill jobs that exceed limits |
198+
| `termination_signal` | string | `"SIGTERM"` | Signal to send before SIGKILL for graceful shutdown |
199+
| `sigterm_lead_seconds` | integer | `30` | Seconds before SIGKILL to send the termination signal |
200+
| `oom_exit_code` | integer | `137` | Exit code for OOM-killed jobs (128 + SIGKILL) |
201+
202+
### Slurm mode fields
203+
204+
These fields only apply when the effective mode is `slurm`. Setting them with `mode: direct` (or
205+
`mode: auto` without `slurm_schedulers`) produces a validation error.
206+
207+
| Name | Type | Default | Description |
208+
| ------------------------- | ------- | ------- | --------------------------------------- |
209+
| `srun_termination_signal` | string | none | Signal spec for `srun --signal=<value>` |
210+
| `enable_cpu_bind` | boolean | `false` | Allow Slurm CPU binding (`--cpu-bind`) |
191211

192212
### StdioConfig
193213

docs/src/core/workflows/workflow-formats.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@ modes:
250250
```yaml
251251
execution_config:
252252
mode: direct # Options: direct, slurm, auto
253-
limit_resources: true # Enforce memory/CPU limits (default: true)
253+
limit_resources: true # Enforce memory limits in direct mode (default: true)
254254
255255
# Direct mode settings
256256
termination_signal: SIGTERM # Signal before SIGKILL (default: SIGTERM)
@@ -293,7 +293,7 @@ execution_config {
293293
| Field | Type | Default | Description |
294294
| -------------------------- | ------ | --------- | ------------------------------------------------- |
295295
| `mode` | string | `auto` | Execution mode: `direct`, `slurm`, or `auto` |
296-
| `limit_resources` | bool | `true` | Enforce memory/CPU limits |
296+
| `limit_resources` | bool | `true` | Enforce memory limits in direct mode only |
297297
| `termination_signal` | string | `SIGTERM` | Signal to send before SIGKILL (direct mode) |
298298
| `sigterm_lead_seconds` | int | `30` | Seconds before SIGKILL to send termination signal |
299299
| `sigkill_headroom_seconds` | int | `60` | Seconds before end_time to send SIGKILL |
@@ -358,21 +358,24 @@ limit_resources: true
358358
# New style - use execution_config
359359
execution_config:
360360
mode: slurm # Replaces use_srun: true
361-
limit_resources: true
362361
srun_termination_signal: "TERM@120"
363362
sigkill_headroom_seconds: 180 # New: controls srun --time headroom
364363
```
365364

366365
**Migration mapping:**
367366

368-
| Old Field | New Field in execution_config |
369-
| ------------------------- | ----------------------------- |
370-
| `use_srun: true` | `mode: slurm` |
371-
| `use_srun: false` | `mode: direct` |
372-
| (not set) | `mode: auto` (default) |
373-
| `limit_resources` | `limit_resources` |
374-
| `srun_termination_signal` | `srun_termination_signal` |
375-
| `enable_cpu_bind` | `enable_cpu_bind` |
367+
| Old Field | New Field in execution_config |
368+
| ------------------------- | -------------------------------------------- |
369+
| `use_srun: true` | `mode: slurm` |
370+
| `use_srun: false` | `mode: direct` |
371+
| (not set) | `mode: auto` (default) |
372+
| `limit_resources: false` | `mode: direct` with `limit_resources: false` |
373+
| `srun_termination_signal` | `srun_termination_signal` |
374+
| `enable_cpu_bind` | `enable_cpu_bind` |
375+
376+
> **Note**: `limit_resources: false` is only supported with `mode: direct`. If you previously used
377+
> `limit_resources: false` with srun, switch to `mode: direct` to get the same behavior (jobs run
378+
> without resource enforcement).
376379

377380
## Common Features Across All Formats
378381

0 commit comments

Comments
 (0)