Skip to content

Commit a3b3b67

Browse files
committed
Refactor debugging docs
1 parent 2f04c17 commit a3b3b67

File tree

3 files changed

+323
-241
lines changed

3 files changed

+323
-241
lines changed

docs/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
- [Terminal UI (TUI)](./how-to/tui.md)
3838
- [Web Dashboard](./how-to/dashboard.md)
3939
- [Debugging Workflows](./how-to/debugging.md)
40+
- [Debugging Slurm Workflows](./how-to/debugging-slurm.md)
4041
- [Authentication](./how-to/authentication.md)
4142
- [Shell Completions](./how-to/shell-completions.md)
4243
- [Server Deployment](./how-to/server-deployment.md)

docs/src/how-to/debugging-slurm.md

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# Debugging Slurm Workflows
2+
3+
When running workflows on Slurm clusters, Torc provides additional debugging tools specifically designed for Slurm environments. This guide covers Slurm-specific debugging techniques and tools.
4+
5+
For general debugging concepts and tools that apply to all workflows, see [Debugging Workflows](debugging.md).
6+
7+
## Overview
8+
9+
Slurm workflows generate additional log files beyond the standard job logs:
10+
11+
- **Slurm stdout/stderr**: Output from Slurm's perspective (job allocation, environment setup)
12+
- **Slurm environment logs**: All SLURM environment variables captured at job runner startup
13+
- **dmesg logs**: Kernel message buffer captured when the Slurm job runner exits
14+
15+
These logs help diagnose issues specific to the cluster environment, such as resource allocation failures, node problems, and system-level errors.
16+
17+
## Slurm Log File Structure
18+
19+
For jobs executed via Slurm scheduler (`compute_node_type: "slurm"`), the debug report includes these additional log paths:
20+
21+
```json
22+
{
23+
"job_stdout": "output/job_stdio/job_456.o",
24+
"job_stderr": "output/job_stdio/job_456.e",
25+
"job_runner_log": "output/job_runner_slurm_12345_node01_67890.log",
26+
"slurm_stdout": "output/slurm_output_12345.o",
27+
"slurm_stderr": "output/slurm_output_12345.e",
28+
"slurm_env_log": "output/slurm_env_12345_node01_67890.log",
29+
"dmesg_log": "output/dmesg_slurm_12345_node01_67890.log"
30+
}
31+
```
32+
33+
### Log File Descriptions
34+
35+
1. **slurm_stdout** (`output/slurm_output_<slurm_job_id>.o`):
36+
- Standard output from Slurm's perspective
37+
- Includes Slurm environment setup, job allocation info
38+
- **Use for**: Debugging Slurm job submission issues
39+
40+
2. **slurm_stderr** (`output/slurm_output_<slurm_job_id>.e`):
41+
- Standard error from Slurm's perspective
42+
- Contains Slurm-specific errors (allocation failures, node issues)
43+
- **Use for**: Investigating Slurm scheduler problems
44+
45+
3. **slurm_env_log** (`output/slurm_env_<slurm_job_id>_<node_id>_<task_pid>.log`):
46+
- All SLURM environment variables captured at job runner startup
47+
- Contains job allocation details, resource limits, node assignments
48+
- **Use for**: Verifying Slurm job configuration, debugging resource allocation issues
49+
50+
4. **dmesg log** (`output/dmesg_slurm_<slurm_job_id>_<node_id>_<task_pid>.log`):
51+
- Kernel message buffer captured when the Slurm job runner exits
52+
- Contains system-level events: OOM killer activity, hardware errors, kernel panics
53+
- **Use for**: Investigating job failures caused by system-level issues (e.g., out-of-memory kills, hardware failures)
54+
55+
**Note**: Slurm job runner logs include the Slurm job ID, node ID, and task PID in the filename for correlation with Slurm's own logs.
56+
57+
## Parsing Slurm Log Files for Errors
58+
59+
The `torc slurm parse-logs` command scans Slurm stdout/stderr log files for known error patterns and correlates them with affected Torc jobs:
60+
61+
```bash
62+
# Parse logs for a specific workflow
63+
torc slurm parse-logs <workflow_id>
64+
65+
# Specify custom output directory
66+
torc slurm parse-logs <workflow_id> --output-dir /path/to/output
67+
68+
# Output as JSON for programmatic processing
69+
torc slurm parse-logs <workflow_id> --format json
70+
```
71+
72+
### Detected Error Patterns
73+
74+
The command detects common Slurm failure patterns including:
75+
76+
**Memory Errors:**
77+
- `out of memory`, `oom-kill`, `cannot allocate memory`
78+
- `memory cgroup out of memory`, `Exceeded job memory limit`
79+
- `task/cgroup: .*: Killed`
80+
- `std::bad_alloc` (C++), `MemoryError` (Python)
81+
82+
**Slurm-Specific Errors:**
83+
- `slurmstepd: error:`, `srun: error:`
84+
- `DUE TO TIME LIMIT`, `DUE TO PREEMPTION`
85+
- `NODE_FAIL`, `FAILED`, `CANCELLED`
86+
- `Exceeded.*step.*limit`
87+
88+
**GPU/CUDA Errors:**
89+
- `CUDA out of memory`, `CUDA error`, `GPU memory.*exceeded`
90+
91+
**Signal/Crash Errors:**
92+
- `Segmentation fault`, `SIGSEGV`
93+
- `Bus error`, `SIGBUS`
94+
- `killed by signal`, `core dumped`
95+
96+
**Python Errors:**
97+
- `Traceback (most recent call last)`
98+
- `ModuleNotFoundError`, `ImportError`
99+
100+
**File System Errors:**
101+
- `No space left on device`, `Disk quota exceeded`
102+
- `Read-only file system`, `Permission denied`
103+
104+
**Network Errors:**
105+
- `Connection refused`, `Connection timed out`, `Network is unreachable`
106+
107+
### Example Output
108+
109+
**Table format:**
110+
111+
```
112+
Slurm Log Analysis Results
113+
==========================
114+
115+
Found 2 error(s) in log files:
116+
117+
╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
118+
│ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │
119+
├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
120+
│ slurm_output_12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
121+
│ slurm_output_12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
122+
╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯
123+
```
124+
125+
## Viewing Slurm Accounting Data
126+
127+
The `torc slurm sacct` command displays a summary of Slurm job accounting data for all scheduled compute nodes in a workflow:
128+
129+
```bash
130+
# Display sacct summary table for a workflow
131+
torc slurm sacct <workflow_id>
132+
133+
# Also save full JSON files for detailed analysis
134+
torc slurm sacct <workflow_id> --save-json --output-dir /path/to/output
135+
136+
# Output as JSON for programmatic processing
137+
torc slurm sacct <workflow_id> --format json
138+
```
139+
140+
### Summary Table Fields
141+
142+
The command displays a summary table with key metrics:
143+
- **Slurm Job**: The Slurm job ID
144+
- **Job Step**: Name of the job step (e.g., "worker_1", "batch")
145+
- **State**: Job state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, etc.)
146+
- **Exit Code**: Exit code of the job step
147+
- **Elapsed**: Wall clock time for the job step
148+
- **Max RSS**: Maximum resident set size (memory usage)
149+
- **CPU Time**: Total CPU time consumed
150+
- **Nodes**: Compute nodes used
151+
152+
### Example Output
153+
154+
```
155+
Slurm Accounting Summary for Workflow 123
156+
157+
╭────────────┬───────────┬───────────┬───────────┬─────────┬─────────┬──────────┬─────────╮
158+
│ Slurm Job │ Job Step │ State │ Exit Code │ Elapsed │ Max RSS │ CPU Time │ Nodes │
159+
├────────────┼───────────┼───────────┼───────────┼─────────┼─────────┼──────────┼─────────┤
160+
│ 12345 │ worker_1 │ COMPLETED │ 0 │ 2h 15m │ 4.5GB │ 4h 30m │ node01 │
161+
│ 12345 │ batch │ COMPLETED │ 0 │ 2h 16m │ 128.0MB │ 1m 30s │ node01 │
162+
│ 12346 │ worker_1 │ FAILED │ 1 │ 45m 30s │ 8.2GB │ 1h 30m │ node02 │
163+
╰────────────┴───────────┴───────────┴───────────┴─────────┴─────────┴──────────┴─────────╯
164+
165+
Total: 3 job steps
166+
```
167+
168+
### Saving Full JSON Output
169+
170+
Use `--save-json` to save full sacct JSON output to files for detailed analysis:
171+
172+
```bash
173+
torc slurm sacct 123 --save-json --output-dir output
174+
# Creates: output/sacct_12345.json, output/sacct_12346.json, etc.
175+
```
176+
177+
## Viewing Slurm Logs in torc-dash
178+
179+
The torc-dash web interface provides two ways to view Slurm logs:
180+
181+
### Debugging Tab - Slurm Log Analysis
182+
183+
The Debugging tab includes a "Slurm Log Analysis" section:
184+
185+
1. Navigate to the **Debugging** tab
186+
2. Find the **Slurm Log Analysis** section
187+
3. Enter the output directory path (default: `output`)
188+
4. Click **Analyze Slurm Logs**
189+
190+
The results show all detected errors with their Slurm job IDs, line numbers, error patterns, severity levels, and affected Torc jobs.
191+
192+
### Debugging Tab - Slurm Accounting Data
193+
194+
The Debugging tab also includes a "Slurm Accounting Data" section:
195+
196+
1. Navigate to the **Debugging** tab
197+
2. Find the **Slurm Accounting Data** section
198+
3. Click **Collect sacct Data**
199+
200+
This displays a summary table showing job state, exit codes, elapsed time, memory usage (Max RSS), CPU time, and nodes for all Slurm job steps. The table helps quickly identify failed jobs and resource usage patterns.
201+
202+
### Scheduled Nodes Tab - View Slurm Logs
203+
204+
You can view individual Slurm job logs directly from the Details view:
205+
206+
1. Select a workflow
207+
2. Go to the **Details** tab
208+
3. Switch to the **Scheduled Nodes** sub-tab
209+
4. Find a Slurm scheduled node in the table
210+
5. Click the **View Logs** button in the Logs column
211+
212+
This opens a modal with tabs for viewing the Slurm job's stdout and stderr files.
213+
214+
## Viewing Slurm Logs in the TUI
215+
216+
The `torc tui` terminal interface also supports Slurm log viewing:
217+
218+
1. Launch the TUI: `torc tui`
219+
2. Select a workflow and press Enter to load details
220+
3. Press Tab to switch to the **Scheduled Nodes** tab
221+
4. Navigate to a Slurm scheduled node using arrow keys
222+
5. Press `l` to view the Slurm job's logs
223+
224+
The log viewer shows:
225+
- **stdout tab**: Slurm job standard output (`slurm_output_<id>.o`)
226+
- **stderr tab**: Slurm job standard error (`slurm_output_<id>.e`)
227+
228+
Use Tab to switch between stdout/stderr, arrow keys to scroll, `/` to search, and `q` to close.
229+
230+
## Debugging Slurm Job Failures
231+
232+
When a Slurm job fails, follow this debugging workflow:
233+
234+
1. **Parse logs for known errors:**
235+
```bash
236+
torc slurm parse-logs <workflow_id>
237+
```
238+
239+
2. **If OOM or resource issues are detected, collect sacct data:**
240+
```bash
241+
torc slurm sacct <workflow_id>
242+
cat output/sacct_<slurm_job_id>.json | jq '.jobs[].steps[].tres.requested'
243+
```
244+
245+
3. **View the specific Slurm log files:**
246+
- Use torc-dash: Details → Scheduled Nodes → View Logs
247+
- Or use TUI: Scheduled Nodes tab → press `l`
248+
- Or directly: `cat output/slurm_output_<slurm_job_id>.e`
249+
250+
4. **Check the job's own stderr for application errors:**
251+
```bash
252+
torc reports results <workflow_id> > report.json
253+
jq -r '.results[] | select(.return_code != 0) | .job_stderr' report.json | xargs cat
254+
```
255+
256+
5. **Review dmesg logs for system-level issues:**
257+
```bash
258+
cat output/dmesg_slurm_<slurm_job_id>_*.log
259+
```
260+
261+
## Common Slurm Issues and Solutions
262+
263+
### Out of Memory (OOM) Kills
264+
265+
**Symptoms:**
266+
- `torc slurm parse-logs` shows "Out of Memory (OOM) Kill"
267+
- Job exits with signal 9 (SIGKILL)
268+
- dmesg log shows "oom-kill" entries
269+
270+
**Solutions:**
271+
- Increase memory request in job specification
272+
- Check `torc slurm sacct` output for actual memory usage (Max RSS)
273+
- Consider splitting job into smaller chunks
274+
275+
### Time Limit Exceeded
276+
277+
**Symptoms:**
278+
- `torc slurm parse-logs` shows "DUE TO TIME LIMIT"
279+
- Job state in sacct shows "TIMEOUT"
280+
281+
**Solutions:**
282+
- Increase runtime in job specification
283+
- Check if job is stuck (review stdout for progress)
284+
- Consider optimizing the job or splitting into phases
285+
286+
### Node Failures
287+
288+
**Symptoms:**
289+
- `torc slurm parse-logs` shows "NODE_FAIL"
290+
- Job may have completed partially
291+
292+
**Solutions:**
293+
- Reinitialize workflow to retry failed jobs
294+
- Check cluster status with `sinfo`
295+
- Review dmesg logs for hardware issues
296+
297+
### GPU/CUDA Errors
298+
299+
**Symptoms:**
300+
- `torc slurm parse-logs` shows "CUDA out of memory" or "CUDA error"
301+
302+
**Solutions:**
303+
- Reduce batch size or model size
304+
- Check GPU memory with `nvidia-smi` in job script
305+
- Ensure correct CUDA version is loaded
306+
307+
## Related Commands
308+
309+
- **`torc slurm parse-logs`**: Parse Slurm logs for known error patterns
310+
- **`torc slurm sacct`**: Collect Slurm accounting data for workflow jobs
311+
- **`torc reports results`**: Generate debug report with all log file paths
312+
- **`torc results list`**: View summary of job results in table format
313+
- **`torc-dash`**: Launch web interface with Slurm log viewing
314+
- **`torc tui`**: Launch terminal UI with Slurm log viewing
315+
316+
For general debugging tools and workflows, see [Debugging Workflows](debugging.md).

0 commit comments

Comments
 (0)