Skip to content

Commit 3e109ea

Browse files
Add Common Pain Points page for OOM, accounts, and wall time
Covers three recurring migration issues: debugging SLURM OOM errors (hard vs soft memory limits), understanding SLURM accounts and fair-share scheduling, and strict wall-time enforcement differences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a7604bb commit 3e109ea

File tree

2 files changed

+240
-0
lines changed

2 files changed

+240
-0
lines changed

docs/pain-points.md

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# Common Pain Points
2+
3+
This page covers recurring issues that Bodhi users encounter when migrating from LSF to SLURM. These aren't simple directive swaps — they're behavioral differences that catch people off guard.
4+
5+
---
6+
7+
## Debugging OOM (Out-of-Memory) errors
8+
9+
### How OOM kills look in SLURM
10+
11+
When a job exceeds its memory allocation, SLURM kills it immediately. The job state is set to `OUT_OF_MEMORY`:
12+
13+
```bash
14+
$ sacct -j 12345 --format=JobID,JobName,State,ExitCode,MaxRSS
15+
JobID JobName State ExitCode MaxRSS
16+
------------ ---------- ---------- -------- ----------
17+
12345 analysis OUT_OF_ME+ 0:125
18+
12345.batch batch OUT_OF_ME+ 0:125 15.8G
19+
```
20+
21+
You can also see this with `seff`:
22+
23+
```bash
24+
$ seff 12345
25+
Job ID: 12345
26+
State: OUT_OF_MEMORY (exit code 0)
27+
Memory Utilized: 15.80 GB
28+
Memory Efficiency: 98.75% of 16.00 GB
29+
```
30+
31+
!!! warning "This is different from LSF"
32+
On Bodhi's LSF, memory limits were often **soft limits** — jobs could exceed their requested memory without being killed (as long as the node had memory available). In SLURM, `--mem` is a **hard limit** enforced by cgroups. If your job exceeds it, even briefly, it will be killed.
33+
34+
### Diagnosing memory usage
35+
36+
**For completed jobs**, use `sacct`:
37+
38+
```bash
39+
# Check peak memory usage
40+
sacct -j <jobid> --format=JobID,JobName,MaxRSS,MaxVMSize,State
41+
42+
# For array jobs, check all tasks
43+
sacct -j <jobid> --format=JobID%20,JobName,MaxRSS,State
44+
```
45+
46+
**For running jobs**, use `sstat`:
47+
48+
```bash
49+
# Monitor memory of a running job
50+
sstat -j <jobid> --format=JobID,MaxRSS,MaxVMSize
51+
```
52+
53+
!!! tip "Use `seff` for quick checks"
54+
`seff <jobid>` gives a one-line summary of memory efficiency for completed jobs. It's the fastest way to check if your job was close to its memory limit.
55+
56+
### Fixing OOM errors
57+
58+
1. **Check what your job actually used** — run `seff <jobid>` on a similar completed job to see actual peak memory.
59+
60+
2. **Request more memory with headroom** — add 20–30% buffer above the observed peak:
61+
62+
```bash
63+
#SBATCH --mem=20G # if your job peaked at ~15 GB
64+
```
65+
66+
3. **Use `--mem-per-cpu` for multi-threaded jobs**if your job scales memory with cores:
67+
68+
```bash
69+
#SBATCH --cpus-per-task=8
70+
#SBATCH --mem-per-cpu=4G # 32 GB total
71+
```
72+
73+
<!-- TODO: verify default --mem value on Bodhi if no --mem is specified -->
74+
75+
!!! note "Don't just request the maximum"
76+
Requesting far more memory than you need reduces scheduling priority and wastes cluster resources. Right-size your requests based on actual usage from `seff`.
77+
78+
---
79+
80+
## Understanding SLURM accounts
81+
82+
### What is `--account`?
83+
84+
In SLURM, the `--account` flag associates your job with a resource allocation account. This is used for:
85+
86+
- **Fair-share scheduling** — accounts that have used fewer resources recently get higher priority
87+
- **Resource tracking** — PIs and admins can see how allocations are consumed
88+
- **Access control** — some partitions may be restricted to certain accounts
89+
90+
!!! warning "Why this matters on Bodhi"
91+
On LSF, the `-P` project flag was often optional or had a simple default. On SLURM, submitting with the wrong account (or no account) can result in job rejection or lower scheduling priority.
92+
93+
### Finding your account(s)
94+
95+
```bash
96+
# List your SLURM associations (accounts and partitions you can use)
97+
sacctmgr show associations user=$USER format=Account,Partition,QOS
98+
99+
# Shorter version — just account names
100+
sacctmgr show associations user=$USER format=Account --noheader | sort -u
101+
```
102+
103+
<!-- TODO: verify what Bodhi accounts look like — are they PI-based (e.g., "hesselj"), lab-based (e.g., "rbi"), or project-based? -->
104+
105+
### Setting a default account
106+
107+
Rather than adding `--account` to every script, set a default:
108+
109+
```bash
110+
# Set your default account (persists across sessions)
111+
sacctmgr modify user $USER set DefaultAccount=<your_account>
112+
```
113+
114+
You can also add it to your `~/.bashrc` or a SLURM defaults file:
115+
116+
```bash
117+
# In ~/.bashrc
118+
export SBATCH_ACCOUNT=<your_account>
119+
export SRUN_ACCOUNT=<your_account>
120+
```
121+
122+
!!! tip "Check your default"
123+
```bash
124+
sacctmgr show user $USER format=DefaultAccount
125+
```
126+
127+
### In your job scripts
128+
129+
```bash
130+
#SBATCH --account=<your_account>
131+
```
132+
133+
<!-- TODO: verify if --account is required on Bodhi or if there's a cluster-wide default -->
134+
135+
---
136+
137+
## Paying attention to wall time
138+
139+
### SLURM enforces `--time` strictly
140+
141+
In SLURM, the `--time` (wall time) limit is a **hard cutoff**. When your job hits the limit:
142+
143+
1. SLURM sends `SIGTERM` to your job (giving it a chance to clean up)
144+
2. After a short grace period<!-- TODO: verify grace period on Bodhi — typically 30-60 seconds -->, SLURM sends `SIGKILL`
145+
3. The job state is set to `TIMEOUT`
146+
147+
```bash
148+
$ sacct -j 12345 --format=JobID,JobName,Elapsed,Timelimit,State
149+
JobID JobName Elapsed Timelimit State
150+
------------ ---------- ---------- ---------- ----------
151+
12345 longrun 02:00:00 02:00:00 TIMEOUT
152+
```
153+
154+
!!! warning "This is different from LSF"
155+
On Bodhi's LSF, wall-time limits were often loosely enforced — jobs could sometimes run past their `-W` limit. In SLURM, when your time is up, your job is killed. Period.
156+
157+
### Checking remaining time
158+
159+
**From outside the job:**
160+
161+
```bash
162+
# See time limit and elapsed time
163+
squeue -u $USER -o "%.10i %.20j %.10M %.10l %.6D %R"
164+
# Elapsed ^ ^ Limit
165+
166+
# Detailed view
167+
scontrol show job <jobid> | grep -E "RunTime|TimeLimit"
168+
```
169+
170+
**From inside the job** (in your script):
171+
172+
```bash
173+
# Remaining time in seconds — useful for checkpointing
174+
squeue -j $SLURM_JOB_ID -h -o "%L"
175+
```
176+
177+
### Consequences of TIMEOUT
178+
179+
- Your job output may be incomplete or corrupted
180+
- Any files being written at kill time may be truncated
181+
- Temporary files won't be cleaned up
182+
183+
!!! tip "Add cleanup traps"
184+
If your job writes large intermediate files, add a trap to handle `SIGTERM`:
185+
186+
```bash
187+
cleanup() {
188+
echo "Job hit time limit — cleaning up"
189+
# save checkpoint, remove temp files, etc.
190+
}
191+
trap cleanup SIGTERM
192+
```
193+
194+
### Bodhi partition time limits
195+
196+
<!-- TODO: verify these partition limits — values below are placeholders -->
197+
198+
| Partition | Max wall time | Default wall time | Notes |
199+
|---|---|---|---|
200+
| `short` | 4 hours | 1 hour | Quick jobs, higher priority |
201+
| `normal` | 7 days | 1 hour | General-purpose |
202+
| `long` | 30 days | 1 hour | Extended runs |
203+
| `gpu` | 7 days | 1 hour | GPU jobs |
204+
| `interactive` | 12 hours | 1 hour | Interactive sessions |
205+
206+
!!! note "Check current limits"
207+
Partition limits can change. Verify the current limits with:
208+
209+
```bash
210+
sinfo -o "%12P %10l %10L %6D %8c %10m"
211+
# Name TimeLimit DefTime Nodes CPUs Memory
212+
```
213+
214+
### Tips for setting wall time
215+
216+
1. **Start with a generous estimate**, then refine based on actual runtimes using `seff` or `sacct`.
217+
218+
2. **Shorter jobs schedule faster** — SLURM's backfill scheduler can fit shorter jobs into gaps. Requesting 2 hours instead of 7 days can dramatically reduce queue wait time.
219+
220+
3. **Use `sacct` to check past runtimes:**
221+
222+
```bash
223+
sacct -u $USER --format=JobID,JobName,Elapsed,State -S 2024-01-01 | grep COMPLETED
224+
```
225+
226+
4. **SLURM format for `--time`:**
227+
228+
| Format | Meaning |
229+
|---|---|
230+
| `MM` | Minutes |
231+
| `HH:MM:SS` | Hours, minutes, seconds |
232+
| `D-HH:MM:SS` | Days, hours, minutes, seconds |
233+
| `D-HH` | Days and hours |
234+
235+
```bash
236+
#SBATCH --time=04:00:00 # 4 hours
237+
#SBATCH --time=1-00:00:00 # 1 day
238+
#SBATCH --time=7-00:00:00 # 7 days
239+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ nav:
3535
- Commands: commands.md
3636
- Environment Variables: environment-variables.md
3737
- Job Arrays: job-arrays.md
38+
- Common Pain Points: pain-points.md
3839
- Examples:
3940
- Scripts: example-scripts.md
4041
- Converter: conversion-script.md

0 commit comments

Comments
 (0)