@@ -24,41 +24,49 @@ these additional log paths:
2424
2525``` json
2626{
27- "job_stdout" : " output/job_stdio/job_456 .o" ,
28- "job_stderr" : " output/job_stdio/job_456 .e" ,
29- "job_runner_log" : " output/job_runner_slurm_12345_node01_67890 .log" ,
30- "slurm_stdout" : " output/slurm_output_12345 .o" ,
31- "slurm_stderr" : " output/slurm_output_12345 .e" ,
32- "slurm_env_log" : " output/slurm_env_12345_node01_67890 .log" ,
33- "dmesg_log" : " output/dmesg_slurm_12345_node01_67890 .log"
27+ "job_stdout" : " output/job_stdio/job_wf1_j456_r1 .o" ,
28+ "job_stderr" : " output/job_stdio/job_wf1_j456_r1 .e" ,
29+ "job_runner_log" : " output/job_runner_slurm_wf1_sl12345_n0_pid67890 .log" ,
30+ "slurm_stdout" : " output/slurm_output_wf1_sl12345 .o" ,
31+ "slurm_stderr" : " output/slurm_output_wf1_sl12345 .e" ,
32+ "slurm_env_log" : " output/slurm_env_wf1_sl12345_n0_pid67890 .log" ,
33+ "dmesg_log" : " output/dmesg_slurm_wf1_sl12345_n0_pid67890 .log"
3434}
3535```
3636
37+ All Slurm log files include the workflow ID (` wf<id> ` ) prefix, making it easy to identify and
38+ collect logs for a specific workflow.
39+
3740### Log File Descriptions
3841
39- 1 . ** slurm_stdout** (` output/slurm_output_ <slurm_job_id>.o ` ):
42+ 1 . ** slurm_stdout** (` output/slurm_output_wf<workflow_id>_sl <slurm_job_id>.o ` ):
4043 - Standard output from Slurm's perspective
4144 - Includes Slurm environment setup, job allocation info
4245 - ** Use for** : Debugging Slurm job submission issues
4346
44- 2 . ** slurm_stderr** (` output/slurm_output_ <slurm_job_id>.e ` ):
47+ 2 . ** slurm_stderr** (` output/slurm_output_wf<workflow_id>_sl <slurm_job_id>.e ` ):
4548 - Standard error from Slurm's perspective
4649 - Contains Slurm-specific errors (allocation failures, node issues)
4750 - ** Use for** : Investigating Slurm scheduler problems
4851
49- 3 . ** slurm_env_log** (` output/slurm_env_<slurm_job_id>_<node_id>_<task_pid>.log ` ):
52+ 3 . ** job_runner_log** (` output/job_runner_slurm_wf<id>_sl<slurm_job_id>_n<node>_pid<pid>.log ` ):
53+ - Log output from the Torc Slurm job runner process
54+ - Contains job execution details, status updates, and runner-level errors
55+ - ** Use for** : Debugging job runner issues, understanding job execution flow
56+
57+ 4 . ** slurm_env_log** (` output/slurm_env_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log ` ):
5058 - All SLURM environment variables captured at job runner startup
5159 - Contains job allocation details, resource limits, node assignments
5260 - ** Use for** : Verifying Slurm job configuration, debugging resource allocation issues
5361
54- 4 . ** dmesg log ** (` output/dmesg_slurm_< slurm_job_id>_ <node_id>_ <task_pid>.log ` ):
55- - Kernel message buffer captured when the Slurm job runner exits
62+ 5 . ** dmesg_log ** (` output/dmesg_slurm_wf<id>_sl< slurm_job_id>_n <node_id>_pid <task_pid>.log ` ):
63+ - Kernel message buffer captured when the Slurm job runner exits (only on failure)
5664 - Contains system-level events: OOM killer activity, hardware errors, kernel panics
5765 - ** Use for** : Investigating job failures caused by system-level issues (e.g., out-of-memory
5866 kills, hardware failures)
5967
60- ** Note** : Slurm job runner logs include the Slurm job ID, node ID, and task PID in the filename for
61- correlation with Slurm's own logs.
68+ ** Note** : All Slurm log files include the workflow ID, Slurm job ID, node ID, and task PID in the
69+ filename for easy filtering and correlation with Slurm's own logs.
6270
6371## Parsing Slurm Log Files for Errors
6472
@@ -131,8 +139,8 @@ Found 2 error(s) in log files:
131139╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
132140│ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │
133141├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
134- │ slurm_output_12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
135- │ slurm_output_12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
142+ │ slurm_output_sl12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
143+ │ slurm_output_sl12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
136144╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯
137145```
138146
@@ -242,8 +250,8 @@ The `torc tui` terminal interface also supports Slurm log viewing:
242250
243251The log viewer shows:
244252
245- - ** stdout tab** : Slurm job standard output (` slurm_output_ <id>.o` )
246- - ** stderr tab** : Slurm job standard error (` slurm_output_ <id>.e` )
253+ - ** stdout tab** : Slurm job standard output (` slurm_output_wf <id>_sl<slurm_job_id >.o` )
254+ - ** stderr tab** : Slurm job standard error (` slurm_output_wf <id>_sl<slurm_job_id >.e` )
247255
248256Use Tab to switch between stdout/stderr, arrow keys to scroll, ` / ` to search, and ` q ` to close.
249257
@@ -265,7 +273,7 @@ When a Slurm job fails, follow this debugging workflow:
2652733 . ** View the specific Slurm log files:**
266274 - Use torc-dash: Details → Scheduled Nodes → View Logs
267275 - Or use TUI: Scheduled Nodes tab → press ` l `
268- - Or directly: ` cat output/slurm_output_ <slurm_job_id>.e `
276+ - Or directly: ` cat output/slurm_output_wf<workflow_id>_sl <slurm_job_id>.e `
269277
2702784 . ** Check the job's own stderr for application errors:**
271279 ``` bash
@@ -275,7 +283,7 @@ When a Slurm job fails, follow this debugging workflow:
275283
2762845 . ** Review dmesg logs for system-level issues:**
277285 ``` bash
278- cat output/dmesg_slurm_ < slurm_job_id> _* .log
286+ cat output/dmesg_slurm_wf < workflow_id > _sl < slurm_job_id> _* .log
279287 ```
280288
281289## Common Slurm Issues and Solutions
@@ -332,8 +340,60 @@ When a Slurm job fails, follow this debugging workflow:
332340- Check GPU memory with ` nvidia-smi ` in job script
333341- Ensure correct CUDA version is loaded
334342
343+ ## Log Bundles
344+
345+ For sharing logs with others or archiving for later analysis, use log bundles:
346+
347+ ### Bundling Logs
348+
349+ ``` bash
350+ # Bundle all logs for a workflow into a compressed tarball
351+ torc logs bundle < workflow_id>
352+
353+ # Specify custom output directory
354+ torc logs bundle < workflow_id> --output-dir /path/to/output
355+
356+ # Save bundle to a specific directory
357+ torc logs bundle < workflow_id> --bundle-dir /path/to/bundles
358+ ```
359+
360+ This creates a ` wf<id>.tar.gz ` file containing:
361+
362+ - All job stdout/stderr files (` job_wf*_j*_r*.o/e ` )
363+ - Job runner logs (` job_runner_*.log ` )
364+ - Slurm output files (` slurm_output_wf*_sl*.o/e ` )
365+ - Slurm environment logs (` slurm_env_wf*_sl*.log ` )
366+ - dmesg logs (` dmesg_slurm_wf*_sl*.log ` )
367+ - Bundle metadata (workflow info, collection timestamp)
368+
369+ ### Analyzing Logs
370+
371+ ``` bash
372+ # Analyze a log bundle tarball
373+ torc logs analyze wf123.tar.gz
374+
375+ # Analyze a log directory directly (auto-detects workflow if only one present)
376+ torc logs analyze output/
377+
378+ # Analyze a directory with multiple workflows (specify which one)
379+ torc logs analyze output/ --workflow-id 123
380+ ```
381+
382+ The analyzer scans all log files for known error patterns (OOM kills, timeouts, segfaults, Slurm
383+ errors, Python exceptions, etc.) and reports:
384+
385+ - Files with detected errors
386+ - Error type and severity
387+ - Line numbers and content
388+ - Summary of error types found
389+
390+ ** Note** : Environment variable files (` slurm_env_*.log ` ) are excluded from error analysis since they
391+ contain configuration data, not error logs.
392+
335393## Related Commands
336394
395+ - ** ` torc logs bundle ` ** : Bundle workflow logs into a compressed tarball
396+ - ** ` torc logs analyze ` ** : Analyze logs for known error patterns
337397- ** ` torc slurm parse-logs ` ** : Parse Slurm logs for known error patterns
338398- ** ` torc slurm sacct ` ** : Collect Slurm accounting data for workflow jobs
339399- ** ` torc reports results ` ** : Generate debug report with all log file paths
0 commit comments