Skip to content

Commit f6ed786

Browse files
authored
Merge pull request #53 from NREL/feat/slurm-debugging
Add slurm parse-logs and sacct CLI commands
2 parents 04dcbf4 + fc87f20 commit f6ed786

File tree

15 files changed

+2400
-10
lines changed

15 files changed

+2400
-10
lines changed

docs/src/how-to/debugging.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -498,12 +498,214 @@ find . -name "job_*.o" -o -name "job_runner_*.log"
498498
torc reports results <workflow_id> --output-dir <correct_path>
499499
```
500500

501+
## Slurm-Specific Debugging Tools
502+
503+
When running workflows on Slurm clusters, Torc provides additional debugging tools specifically designed for Slurm environments.
504+
505+
### Parsing Slurm Log Files for Errors
506+
507+
The `torc slurm parse-logs` command scans Slurm stdout/stderr log files for known error patterns and correlates them with affected Torc jobs:
508+
509+
```bash
510+
# Parse logs for a specific workflow
511+
torc slurm parse-logs <workflow_id>
512+
513+
# Specify custom output directory
514+
torc slurm parse-logs <workflow_id> --output-dir /path/to/output
515+
516+
# Output as JSON for programmatic processing
517+
torc slurm parse-logs <workflow_id> --format json
518+
```
519+
520+
The command detects common Slurm failure patterns including:
521+
522+
**Memory Errors:**
523+
- `out of memory`, `oom-kill`, `cannot allocate memory`
524+
- `memory cgroup out of memory`, `Exceeded job memory limit`
525+
- `task/cgroup: .*: Killed`
526+
- `std::bad_alloc` (C++), `MemoryError` (Python)
527+
528+
**Slurm-Specific Errors:**
529+
- `slurmstepd: error:`, `srun: error:`
530+
- `DUE TO TIME LIMIT`, `DUE TO PREEMPTION`
531+
- `NODE_FAIL`, `FAILED`, `CANCELLED`
532+
- `Exceeded.*step.*limit`
533+
534+
**GPU/CUDA Errors:**
535+
- `CUDA out of memory`, `CUDA error`, `GPU memory.*exceeded`
536+
537+
**Signal/Crash Errors:**
538+
- `Segmentation fault`, `SIGSEGV`
539+
- `Bus error`, `SIGBUS`
540+
- `killed by signal`, `core dumped`
541+
542+
**Python Errors:**
543+
- `Traceback (most recent call last)`
544+
- `ModuleNotFoundError`, `ImportError`
545+
546+
**File System Errors:**
547+
- `No space left on device`, `Disk quota exceeded`
548+
- `Read-only file system`, `Permission denied`
549+
550+
**Network Errors:**
551+
- `Connection refused`, `Connection timed out`, `Network is unreachable`
552+
553+
**Example output (table format):**
554+
555+
```
556+
Slurm Log Analysis Results
557+
==========================
558+
559+
Found 2 error(s) in log files:
560+
561+
╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
562+
│ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │
563+
├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
564+
│ slurm_output_12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
565+
│ slurm_output_12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
566+
╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯
567+
```
568+
569+
### Viewing Slurm Accounting Data
570+
571+
The `torc slurm sacct` command displays a summary of Slurm job accounting data for all scheduled compute nodes in a workflow:
572+
573+
```bash
574+
# Display sacct summary table for a workflow
575+
torc slurm sacct <workflow_id>
576+
577+
# Also save full JSON files for detailed analysis
578+
torc slurm sacct <workflow_id> --save-json --output-dir /path/to/output
579+
580+
# Output as JSON for programmatic processing
581+
torc slurm sacct <workflow_id> --format json
582+
```
583+
584+
The command displays a summary table with key metrics:
585+
- **Slurm Job**: The Slurm job ID
586+
- **Job Step**: Name of the job step (e.g., "worker_1", "batch")
587+
- **State**: Job state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, etc.)
588+
- **Exit Code**: Exit code of the job step
589+
- **Elapsed**: Wall clock time for the job step
590+
- **Max RSS**: Maximum resident set size (memory usage)
591+
- **CPU Time**: Total CPU time consumed
592+
- **Nodes**: Compute nodes used
593+
594+
**Example output:**
595+
596+
```
597+
Slurm Accounting Summary for Workflow 123
598+
599+
╭────────────┬───────────┬───────────┬───────────┬─────────┬─────────┬──────────┬─────────╮
600+
│ Slurm Job │ Job Step │ State │ Exit Code │ Elapsed │ Max RSS │ CPU Time │ Nodes │
601+
├────────────┼───────────┼───────────┼───────────┼─────────┼─────────┼──────────┼─────────┤
602+
│ 12345 │ worker_1 │ COMPLETED │ 0 │ 2h 15m │ 4.5GB │ 4h 30m │ node01 │
603+
│ 12345 │ batch │ COMPLETED │ 0 │ 2h 16m │ 128.0MB │ 1m 30s │ node01 │
604+
│ 12346 │ worker_1 │ FAILED │ 1 │ 45m 30s │ 8.2GB │ 1h 30m │ node02 │
605+
╰────────────┴───────────┴───────────┴───────────┴─────────┴─────────┴──────────┴─────────╯
606+
607+
Total: 3 job steps
608+
```
609+
610+
Use `--save-json` to also save full sacct JSON output to files for detailed analysis:
611+
```bash
612+
torc slurm sacct 123 --save-json --output-dir output
613+
# Creates: output/sacct_12345.json, output/sacct_12346.json, etc.
614+
```
615+
616+
### Viewing Slurm Logs in torc-dash
617+
618+
The torc-dash web interface provides two ways to view Slurm logs:
619+
620+
#### Debugging Tab - Slurm Log Analysis
621+
622+
The Debugging tab includes a "Slurm Log Analysis" section:
623+
624+
1. Navigate to the **Debugging** tab
625+
2. Find the **Slurm Log Analysis** section
626+
3. Enter the output directory path (default: `output`)
627+
4. Click **Analyze Slurm Logs**
628+
629+
The results show all detected errors with their Slurm job IDs, line numbers, error patterns, severity levels, and affected Torc jobs.
630+
631+
#### Debugging Tab - Slurm Accounting Data
632+
633+
The Debugging tab also includes a "Slurm Accounting Data" section:
634+
635+
1. Navigate to the **Debugging** tab
636+
2. Find the **Slurm Accounting Data** section
637+
3. Click **Collect sacct Data**
638+
639+
This displays a summary table showing job state, exit codes, elapsed time, memory usage (Max RSS), CPU time, and nodes for all Slurm job steps. The table helps quickly identify failed jobs and resource usage patterns.
640+
641+
#### Scheduled Nodes Tab - View Slurm Logs
642+
643+
You can view individual Slurm job logs directly from the Details view:
644+
645+
1. Select a workflow
646+
2. Go to the **Details** tab
647+
3. Switch to the **Scheduled Nodes** sub-tab
648+
4. Find a Slurm scheduled node in the table
649+
5. Click the **View Logs** button in the Logs column
650+
651+
This opens a modal with tabs for viewing the Slurm job's stdout and stderr files.
652+
653+
### Viewing Slurm Logs in the TUI
654+
655+
The `torc tui` terminal interface also supports Slurm log viewing:
656+
657+
1. Launch the TUI: `torc tui`
658+
2. Select a workflow and press Enter to load details
659+
3. Press Tab to switch to the **Scheduled Nodes** tab
660+
4. Navigate to a Slurm scheduled node using arrow keys
661+
5. Press `l` to view the Slurm job's logs
662+
663+
The log viewer shows:
664+
- **stdout tab**: Slurm job standard output (`slurm_output_<id>.o`)
665+
- **stderr tab**: Slurm job standard error (`slurm_output_<id>.e`)
666+
667+
Use Tab to switch between stdout/stderr, arrow keys to scroll, `/` to search, and `q` to close.
668+
669+
### Debugging Slurm Job Failures
670+
671+
When a Slurm job fails, follow this debugging workflow:
672+
673+
1. **Parse logs for known errors:**
674+
```bash
675+
torc slurm parse-logs <workflow_id>
676+
```
677+
678+
2. **If OOM or resource issues are detected, collect sacct data:**
679+
```bash
680+
torc slurm sacct <workflow_id>
681+
cat output/sacct_<slurm_job_id>.json | jq '.jobs[].steps[].tres.requested'
682+
```
683+
684+
3. **View the specific Slurm log files:**
685+
- Use torc-dash: Details → Scheduled Nodes → View Logs
686+
- Or use TUI: Scheduled Nodes tab → press `l`
687+
- Or directly: `cat output/slurm_output_<slurm_job_id>.e`
688+
689+
4. **Check the job's own stderr for application errors:**
690+
```bash
691+
torc reports results <workflow_id> > report.json
692+
jq -r '.results[] | select(.return_code != 0) | .job_stderr' report.json | xargs cat
693+
```
694+
695+
5. **Review dmesg logs for system-level issues:**
696+
```bash
697+
cat output/dmesg_slurm_<slurm_job_id>_*.log
698+
```
699+
501700
## Related Commands
502701

503702
- **`torc results list`**: View summary of job results in table format
504703
- **`torc workflows status`**: Check overall workflow status
505704
- **`torc reports check-resource-utilization`**: Analyze resource usage and find over-utilized jobs
506705
- **`torc jobs list`**: View all jobs and their current status
706+
- **`torc slurm parse-logs`**: Parse Slurm logs for known error patterns
707+
- **`torc slurm sacct`**: Collect Slurm accounting data for workflow jobs
507708
- **`torc-dash`**: Launch web interface with interactive Debugging tab
709+
- **`torc tui`**: Launch terminal UI with Slurm log viewing
508710

509711
The `reports results` command and torc-dash Debugging tab complement these by providing complete log file paths and content viewing for in-depth debugging when high-level views aren't sufficient.

0 commit comments

Comments
 (0)