Skip to content

Commit 7552765

Browse files
authored
Create log bundles for debug and analysis (#65)
* Add log bundle * Improve CLI help output
1 parent 9c68b61 commit 7552765

File tree

18 files changed

+1617
-99
lines changed

18 files changed

+1617
-99
lines changed

Cargo.toml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ hdrhistogram = "7.5"
3636
signal-hook = "0.3"
3737
libc = "0.2"
3838
file-rotate = "0.7"
39-
clap = { version = "4.4", features = ["derive", "env", "color"] }
40-
clap_complete = "4.4"
39+
clap = { version = "4.5", features = ["derive", "env", "color"] }
40+
clap_complete = "4.5"
4141
tokio = { version = "1.47", features = ["rt-multi-thread", "macros", "net", "signal"] }
4242
url = "2.5"
4343
validator = { version = "0.16", features = ["derive"] }
@@ -86,6 +86,8 @@ kdl = "6.5"
8686
rusqlite = { version = "0.32", features = ["bundled"] }
8787
nvml-wrapper = "0.10"
8888
shlex = "1.3"
89+
flate2 = "1.0"
90+
tar = "0.4"
8991

9092
# Service management
9193
service-manager = "0.7"
@@ -168,6 +170,8 @@ client = [
168170
"dep:percent-encoding",
169171
"dep:sha2",
170172
"dep:shlex",
173+
"dep:flate2",
174+
"dep:tar",
171175
"config",
172176
]
173177
tui = [
@@ -241,6 +245,8 @@ signal-hook = { workspace = true, optional = true }
241245
libc = { workspace = true, optional = true }
242246
nvml-wrapper = { workspace = true, optional = true }
243247
shlex = { workspace = true, optional = true }
248+
flate2 = { workspace = true, optional = true }
249+
tar = { workspace = true, optional = true }
244250

245251
# TUI dependencies (optional)
246252
ratatui = { workspace = true, optional = true }

docs/src/how-to/debugging-slurm.md

Lines changed: 80 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -24,41 +24,49 @@ these additional log paths:
2424

2525
```json
2626
{
27-
"job_stdout": "output/job_stdio/job_456.o",
28-
"job_stderr": "output/job_stdio/job_456.e",
29-
"job_runner_log": "output/job_runner_slurm_12345_node01_67890.log",
30-
"slurm_stdout": "output/slurm_output_12345.o",
31-
"slurm_stderr": "output/slurm_output_12345.e",
32-
"slurm_env_log": "output/slurm_env_12345_node01_67890.log",
33-
"dmesg_log": "output/dmesg_slurm_12345_node01_67890.log"
27+
"job_stdout": "output/job_stdio/job_wf1_j456_r1.o",
28+
"job_stderr": "output/job_stdio/job_wf1_j456_r1.e",
29+
"job_runner_log": "output/job_runner_slurm_wf1_sl12345_n0_pid67890.log",
30+
"slurm_stdout": "output/slurm_output_wf1_sl12345.o",
31+
"slurm_stderr": "output/slurm_output_wf1_sl12345.e",
32+
"slurm_env_log": "output/slurm_env_wf1_sl12345_n0_pid67890.log",
33+
"dmesg_log": "output/dmesg_slurm_wf1_sl12345_n0_pid67890.log"
3434
}
3535
```
3636

37+
All Slurm log files include the workflow ID (`wf<id>`) prefix, making it easy to identify and
38+
collect logs for a specific workflow.
39+
3740
### Log File Descriptions
3841

39-
1. **slurm_stdout** (`output/slurm_output_<slurm_job_id>.o`):
42+
1. **slurm_stdout** (`output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.o`):
4043
- Standard output from Slurm's perspective
4144
- Includes Slurm environment setup, job allocation info
4245
- **Use for**: Debugging Slurm job submission issues
4346

44-
2. **slurm_stderr** (`output/slurm_output_<slurm_job_id>.e`):
47+
2. **slurm_stderr** (`output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.e`):
4548
- Standard error from Slurm's perspective
4649
- Contains Slurm-specific errors (allocation failures, node issues)
4750
- **Use for**: Investigating Slurm scheduler problems
4851

49-
3. **slurm_env_log** (`output/slurm_env_<slurm_job_id>_<node_id>_<task_pid>.log`):
52+
3. **job_runner_log** (`output/job_runner_slurm_wf<id>_sl<slurm_job_id>_n<node>_pid<pid>.log`):
53+
- Log output from the Torc Slurm job runner process
54+
- Contains job execution details, status updates, and runner-level errors
55+
- **Use for**: Debugging job runner issues, understanding job execution flow
56+
57+
4. **slurm_env_log** (`output/slurm_env_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log`):
5058
- All SLURM environment variables captured at job runner startup
5159
- Contains job allocation details, resource limits, node assignments
5260
- **Use for**: Verifying Slurm job configuration, debugging resource allocation issues
5361

54-
4. **dmesg log** (`output/dmesg_slurm_<slurm_job_id>_<node_id>_<task_pid>.log`):
55-
- Kernel message buffer captured when the Slurm job runner exits
62+
5. **dmesg_log** (`output/dmesg_slurm_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log`):
63+
- Kernel message buffer captured when the Slurm job runner exits (only on failure)
5664
- Contains system-level events: OOM killer activity, hardware errors, kernel panics
5765
- **Use for**: Investigating job failures caused by system-level issues (e.g., out-of-memory
5866
kills, hardware failures)
5967

60-
**Note**: Slurm job runner logs include the Slurm job ID, node ID, and task PID in the filename for
61-
correlation with Slurm's own logs.
68+
**Note**: All Slurm log files include the workflow ID, Slurm job ID, node ID, and task PID in the
69+
filename for easy filtering and correlation with Slurm's own logs.
6270

6371
## Parsing Slurm Log Files for Errors
6472

@@ -131,8 +139,8 @@ Found 2 error(s) in log files:
131139
╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
132140
│ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │
133141
├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
134-
slurm_output_12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
135-
slurm_output_12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
142+
slurm_output_sl12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
143+
slurm_output_sl12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
136144
╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯
137145
```
138146

@@ -242,8 +250,8 @@ The `torc tui` terminal interface also supports Slurm log viewing:
242250

243251
The log viewer shows:
244252

245-
- **stdout tab**: Slurm job standard output (`slurm_output_<id>.o`)
246-
- **stderr tab**: Slurm job standard error (`slurm_output_<id>.e`)
253+
- **stdout tab**: Slurm job standard output (`slurm_output_wf<id>_sl<slurm_job_id>.o`)
254+
- **stderr tab**: Slurm job standard error (`slurm_output_wf<id>_sl<slurm_job_id>.e`)
247255

248256
Use Tab to switch between stdout/stderr, arrow keys to scroll, `/` to search, and `q` to close.
249257

@@ -265,7 +273,7 @@ When a Slurm job fails, follow this debugging workflow:
265273
3. **View the specific Slurm log files:**
266274
- Use torc-dash: Details → Scheduled Nodes → View Logs
267275
- Or use TUI: Scheduled Nodes tab → press `l`
268-
- Or directly: `cat output/slurm_output_<slurm_job_id>.e`
276+
- Or directly: `cat output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.e`
269277

270278
4. **Check the job's own stderr for application errors:**
271279
```bash
@@ -275,7 +283,7 @@ When a Slurm job fails, follow this debugging workflow:
275283

276284
5. **Review dmesg logs for system-level issues:**
277285
```bash
278-
cat output/dmesg_slurm_<slurm_job_id>_*.log
286+
cat output/dmesg_slurm_wf<workflow_id>_sl<slurm_job_id>_*.log
279287
```
280288

281289
## Common Slurm Issues and Solutions
@@ -332,8 +340,60 @@ When a Slurm job fails, follow this debugging workflow:
332340
- Check GPU memory with `nvidia-smi` in job script
333341
- Ensure correct CUDA version is loaded
334342

343+
## Log Bundles
344+
345+
For sharing logs with others or archiving for later analysis, use log bundles:
346+
347+
### Bundling Logs
348+
349+
```bash
350+
# Bundle all logs for a workflow into a compressed tarball
351+
torc logs bundle <workflow_id>
352+
353+
# Specify custom output directory
354+
torc logs bundle <workflow_id> --output-dir /path/to/output
355+
356+
# Save bundle to a specific directory
357+
torc logs bundle <workflow_id> --bundle-dir /path/to/bundles
358+
```
359+
360+
This creates a `wf<id>.tar.gz` file containing:
361+
362+
- All job stdout/stderr files (`job_wf*_j*_r*.o/e`)
363+
- Job runner logs (`job_runner_*.log`)
364+
- Slurm output files (`slurm_output_wf*_sl*.o/e`)
365+
- Slurm environment logs (`slurm_env_wf*_sl*.log`)
366+
- dmesg logs (`dmesg_slurm_wf*_sl*.log`)
367+
- Bundle metadata (workflow info, collection timestamp)
368+
369+
### Analyzing Logs
370+
371+
```bash
372+
# Analyze a log bundle tarball
373+
torc logs analyze wf123.tar.gz
374+
375+
# Analyze a log directory directly (auto-detects workflow if only one present)
376+
torc logs analyze output/
377+
378+
# Analyze a directory with multiple workflows (specify which one)
379+
torc logs analyze output/ --workflow-id 123
380+
```
381+
382+
The analyzer scans all log files for known error patterns (OOM kills, timeouts, segfaults, Slurm
383+
errors, Python exceptions, etc.) and reports:
384+
385+
- Files with detected errors
386+
- Error type and severity
387+
- Line numbers and content
388+
- Summary of error types found
389+
390+
**Note**: Environment variable files (`slurm_env_*.log`) are excluded from error analysis since they
391+
contain configuration data, not error logs.
392+
335393
## Related Commands
336394

395+
- **`torc logs bundle`**: Bundle workflow logs into a compressed tarball
396+
- **`torc logs analyze`**: Analyze logs for known error patterns
337397
- **`torc slurm parse-logs`**: Parse Slurm logs for known error patterns
338398
- **`torc slurm sacct`**: Collect Slurm accounting data for workflow jobs
339399
- **`torc reports results`**: Generate debug report with all log file paths

0 commit comments

Comments
 (0)