Skip to content

Commit 1bb35c2

Browse files
authored
Merge pull request #59 from NREL/feat/mcp-server
Add a Torc MCP server
2 parents 8c644b7 + 96686c6 commit 1bb35c2

29 files changed

+5522
-12
lines changed

Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ members = [
44
"torc-server",
55
"torc-slurm-job-runner",
66
"torc-dash",
7+
"torc-mcp-server",
78
]
89
resolver = "2"
910

@@ -96,6 +97,10 @@ hyper-tls = "0.5"
9697
hyper-openssl = "0.9"
9798
openssl = "0.10"
9899

100+
# MCP server
101+
rmcp = { version = "0.1", features = ["server", "macros", "transport-io"] }
102+
schemars = "1.0"
103+
99104
[package]
100105
name = "torc"
101106
version.workspace = true
@@ -159,6 +164,7 @@ client = [
159164
"dep:signal-hook",
160165
"dep:libc",
161166
"dep:nvml-wrapper",
167+
"dep:sha2",
162168
"config",
163169
]
164170
tui = [

docs/src/SUMMARY.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
- [Parallelization Strategies](./explanation/parallelization.md)
2525
- [Workflow Actions](./explanation/workflow-actions.md)
2626
- [Slurm Workflows](./explanation/slurm-workflows.md)
27+
- [Automatic Failure Recovery](./explanation/automatic-recovery.md)
2728
- [Design](./explanation/design/README.md)
2829
- [Server API Handler](./explanation/design/server.md)
2930
- [Central Database](./explanation/design/database.md)
@@ -75,6 +76,8 @@
7576
- [Map Python functions across workers](./tutorials/map_python_function_across_workers.md)
7677
- [Filtering CLI Output with Nushell](./tutorials/filtering-with-nushell.md)
7778
- [Custom HPC Profile](./tutorials/custom-hpc-profile.md)
79+
- [MCP Server with Claude Code](./tutorials/mcp-server.md)
80+
- [Automatic Failure Recovery](./tutorials/automatic-recovery.md)
7881

7982
---
8083

docs/src/explanation/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ This section provides understanding-oriented discussions of Torc's key concepts
1616
- Ready queue optimization for large workflows
1717
- Parallelization strategies and job allocation approaches
1818
- Workflow actions for automation and dynamic resource allocation
19+
- AI-assisted recovery for diagnosing and fixing job failures
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Automatic Failure Recovery
2+
3+
This document explains how Torc's automatic failure recovery system works, its design principles, and when to use automatic vs manual recovery.
4+
5+
## Overview
6+
7+
Torc provides **automatic failure recovery** through the `torc watch --auto-recover` command. When jobs fail, the system:
8+
9+
1. Diagnoses the failure cause (OOM, timeout, or unknown)
10+
2. Applies heuristics to adjust resource requirements
11+
3. Resets failed jobs and submits new Slurm allocations
12+
4. Resumes monitoring until completion or max retries
13+
14+
This deterministic approach handles the majority of HPC failures without human intervention.
15+
16+
## Design Principles
17+
18+
### Why Deterministic Recovery?
19+
20+
Most HPC job failures fall into predictable categories:
21+
22+
| Failure Type | Frequency | Solution |
23+
|--------------|-----------|----------|
24+
| Out of Memory | ~60% | Increase memory allocation |
25+
| Timeout | ~25% | Increase runtime limit |
26+
| Transient errors | ~10% | Simple retry |
27+
| Code bugs | ~5% | Manual intervention |
28+
29+
For 85-90% of failures, the solution is mechanical: increase resources and retry. This doesn't require AI judgment—simple heuristics work well.
30+
31+
### Recovery Architecture
32+
33+
```mermaid
34+
flowchart LR
35+
A[torc watch<br/>polling] --> B{Workflow<br/>complete?}
36+
B -->|No| A
37+
B -->|Yes, with failures| C[Diagnose failures<br/>check resources]
38+
C --> D[Apply heuristics<br/>adjust resources]
39+
D --> E[Submit new<br/>allocations]
40+
E --> A
41+
B -->|Yes, success| F[Exit 0]
42+
```
43+
44+
### Failure Detection
45+
46+
Torc tracks resource usage during job execution:
47+
- Memory usage (RSS and peak)
48+
- CPU utilization
49+
- Execution time
50+
51+
This data is analyzed to determine failure causes:
52+
53+
**OOM Detection:**
54+
- Peak memory exceeds specified limit
55+
- Exit code 137 (SIGKILL from OOM killer)
56+
- Flag: `likely_oom: true`
57+
58+
**Timeout Detection:**
59+
- Execution time within 10% of runtime limit
60+
- Job was killed (not graceful exit)
61+
- Flag: `likely_timeout: true`
62+
63+
### Recovery Heuristics
64+
65+
Default multipliers applied to failed jobs:
66+
67+
| Failure | Default Multiplier | Configurable |
68+
|---------|-------------------|--------------|
69+
| OOM | 1.5x memory | `--memory-multiplier` |
70+
| Timeout | 1.5x runtime | `--runtime-multiplier` |
71+
72+
Example: A job with 8g memory that fails with OOM gets 12g on retry.
73+
74+
### Slurm Scheduler Regeneration
75+
76+
After adjusting resources, the system regenerates Slurm schedulers:
77+
78+
1. Finds all pending jobs (uninitialized, ready, blocked)
79+
2. Groups by resource requirements
80+
3. Calculates minimum allocations needed
81+
4. Creates new schedulers with appropriate walltimes
82+
5. Submits allocations to Slurm
83+
84+
This is handled by `torc slurm regenerate --submit`.
85+
86+
## Configuration
87+
88+
### Command-Line Options
89+
90+
```bash
91+
torc watch <workflow_id> \
92+
--auto-recover \ # Enable automatic recovery
93+
--max-retries 3 \ # Maximum recovery attempts
94+
--memory-multiplier 1.5 \ # Memory increase factor for OOM
95+
--runtime-multiplier 1.5 \ # Runtime increase factor for timeout
96+
--poll-interval 60 \ # Seconds between status checks
97+
--output-dir output \ # Directory for job output files
98+
--show-job-counts # Display job counts during polling (optional)
99+
```
100+
101+
### Retry Limits
102+
103+
The `--max-retries` option prevents infinite retry loops. After exceeding this limit, the system exits with an error, indicating manual intervention is needed.
104+
105+
Default: 3 retries
106+
107+
## When to Use Manual Recovery
108+
109+
Automatic recovery works well for resource-related failures, but some situations require manual intervention:
110+
111+
### Use Manual Recovery When:
112+
113+
1. **Jobs keep failing after max retries**
114+
- The heuristics aren't solving the problem
115+
- Need to investigate root cause
116+
117+
2. **Unknown failure modes**
118+
- Exit codes that don't indicate OOM/timeout
119+
- Application-specific errors
120+
121+
3. **Code bugs**
122+
- Jobs fail consistently with same error
123+
- No resource issue detected
124+
125+
4. **Cost optimization**
126+
- Want to analyze actual usage before increasing
127+
- Need to decide whether job is worth more resources
128+
129+
### MCP Server for Manual Recovery
130+
131+
The Torc MCP server provides tools for AI-assisted investigation:
132+
133+
| Tool | Purpose |
134+
|------|---------|
135+
| `get_workflow_status` | Get overall workflow status |
136+
| `list_failed_jobs` | List failed jobs with error info |
137+
| `get_job_logs` | Read stdout/stderr logs |
138+
| `check_resource_utilization` | Detailed resource analysis |
139+
| `update_job_resources` | Manually adjust resources |
140+
| `restart_jobs` | Reset and restart jobs |
141+
| `resubmit_workflow` | Regenerate Slurm schedulers |
142+
143+
## Comparison
144+
145+
| Feature | Automatic | Manual/AI-Assisted |
146+
|---------|-----------|-------------------|
147+
| Human involvement | None | Interactive |
148+
| Speed | Fast | Depends on human |
149+
| Handles OOM/timeout | Yes | Yes |
150+
| Handles unknown errors | Retry only | Full investigation |
151+
| Cost optimization | Basic | Can be sophisticated |
152+
| Use case | Production workflows | Debugging, optimization |
153+
154+
## Implementation Details
155+
156+
### The Watch Command
157+
158+
```bash
159+
torc watch <workflow_id> --auto-recover
160+
```
161+
162+
Main loop:
163+
1. Poll `is_workflow_complete` API
164+
2. Print status updates
165+
3. On completion, check for failures
166+
4. If failures and auto-recover enabled:
167+
- Run `torc reports check-resource-utilization --include-failed`
168+
- Parse results for `likely_oom` and `likely_timeout` flags
169+
- Update resource requirements via API
170+
- Run `torc workflows reset-status --failed-only --restart`
171+
- Run `torc slurm regenerate --submit`
172+
- Increment retry counter
173+
- Resume polling
174+
5. Exit 0 on success, exit 1 on max retries exceeded
175+
176+
### The Regenerate Command
177+
178+
```bash
179+
torc slurm regenerate <workflow_id> --submit
180+
```
181+
182+
1. Query jobs with status uninitialized/ready/blocked
183+
2. Group by resource requirements
184+
3. For each group:
185+
- Find best partition using HPC profile
186+
- Calculate jobs per node
187+
- Determine number of allocations needed
188+
- Create scheduler config
189+
4. Update jobs with new scheduler reference
190+
5. Submit allocations via sbatch
191+
192+
## See Also
193+
194+
- [Automatic Failure Recovery Tutorial](../tutorials/automatic-recovery.md) - Step-by-step guide
195+
- [MCP Server Tutorial](../tutorials/mcp-server.md) - Setting up AI-assisted tools
196+
- [Resource Monitoring](../how-to/resource-monitoring.md) - Understanding resource tracking

docs/src/tutorials/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ This section contains learning-oriented lessons to help you get started with Tor
1616
10. [Map Python Functions](./map_python_function_across_workers.md) - Distribute Python functions across workers
1717
11. [Filtering CLI Output with Nushell](./filtering-with-nushell.md) - Filter jobs, results, and user data with readable queries
1818
12. [Custom HPC Profile](./custom-hpc-profile.md) - Create an HPC profile for unsupported clusters
19+
13. [MCP Server with Claude Code](./mcp-server.md) - Enable Claude to interact with your workflows
20+
14. [Automatic Failure Recovery](./automatic-recovery.md) - Autonomous workflow monitoring with `torc watch`
1921

2022
Start with the Configuration Files tutorial to set up your environment, then try the Dashboard Deployment tutorial if you want to use the web interface.
2123

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Automatic Failure Recovery

0 commit comments

Comments
 (0)