|
| 1 | +# Automatic Failure Recovery |
| 2 | + |
| 3 | +This document explains how Torc's automatic failure recovery system works, its design principles, and when to use automatic vs manual recovery. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Torc provides **automatic failure recovery** through the `torc watch --auto-recover` command. When jobs fail, the system: |
| 8 | + |
| 9 | +1. Diagnoses the failure cause (OOM, timeout, or unknown) |
| 10 | +2. Applies heuristics to adjust resource requirements |
| 11 | +3. Resets failed jobs and submits new Slurm allocations |
| 12 | +4. Resumes monitoring until completion or max retries |
| 13 | + |
| 14 | +This deterministic approach handles the majority of HPC failures without human intervention. |
| 15 | + |
| 16 | +## Design Principles |
| 17 | + |
| 18 | +### Why Deterministic Recovery? |
| 19 | + |
| 20 | +Most HPC job failures fall into predictable categories: |
| 21 | + |
| 22 | +| Failure Type | Frequency | Solution | |
| 23 | +|--------------|-----------|----------| |
| 24 | +| Out of Memory | ~60% | Increase memory allocation | |
| 25 | +| Timeout | ~25% | Increase runtime limit | |
| 26 | +| Transient errors | ~10% | Simple retry | |
| 27 | +| Code bugs | ~5% | Manual intervention | |
| 28 | + |
| 29 | +For 85-90% of failures, the solution is mechanical: increase resources and retry. This doesn't require AI judgment—simple heuristics work well. |
| 30 | + |
| 31 | +### Recovery Architecture |
| 32 | + |
| 33 | +```mermaid |
| 34 | +flowchart LR |
| 35 | + A[torc watch<br/>polling] --> B{Workflow<br/>complete?} |
| 36 | + B -->|No| A |
| 37 | + B -->|Yes, with failures| C[Diagnose failures<br/>check resources] |
| 38 | + C --> D[Apply heuristics<br/>adjust resources] |
| 39 | + D --> E[Submit new<br/>allocations] |
| 40 | + E --> A |
| 41 | + B -->|Yes, success| F[Exit 0] |
| 42 | +``` |
| 43 | + |
| 44 | +### Failure Detection |
| 45 | + |
| 46 | +Torc tracks resource usage during job execution: |
| 47 | +- Memory usage (RSS and peak) |
| 48 | +- CPU utilization |
| 49 | +- Execution time |
| 50 | + |
| 51 | +This data is analyzed to determine failure causes: |
| 52 | + |
| 53 | +**OOM Detection:** |
| 54 | +- Peak memory exceeds specified limit |
| 55 | +- Exit code 137 (SIGKILL from OOM killer) |
| 56 | +- Flag: `likely_oom: true` |
| 57 | + |
| 58 | +**Timeout Detection:** |
| 59 | +- Execution time within 10% of runtime limit |
| 60 | +- Job was killed (not graceful exit) |
| 61 | +- Flag: `likely_timeout: true` |
| 62 | + |
| 63 | +### Recovery Heuristics |
| 64 | + |
| 65 | +Default multipliers applied to failed jobs: |
| 66 | + |
| 67 | +| Failure | Default Multiplier | Configurable | |
| 68 | +|---------|-------------------|--------------| |
| 69 | +| OOM | 1.5x memory | `--memory-multiplier` | |
| 70 | +| Timeout | 1.5x runtime | `--runtime-multiplier` | |
| 71 | + |
| 72 | +Example: A job with 8g memory that fails with OOM gets 12g on retry. |
| 73 | + |
| 74 | +### Slurm Scheduler Regeneration |
| 75 | + |
| 76 | +After adjusting resources, the system regenerates Slurm schedulers: |
| 77 | + |
| 78 | +1. Finds all pending jobs (uninitialized, ready, blocked) |
| 79 | +2. Groups by resource requirements |
| 80 | +3. Calculates minimum allocations needed |
| 81 | +4. Creates new schedulers with appropriate walltimes |
| 82 | +5. Submits allocations to Slurm |
| 83 | + |
| 84 | +This is handled by `torc slurm regenerate --submit`. |
| 85 | + |
| 86 | +## Configuration |
| 87 | + |
| 88 | +### Command-Line Options |
| 89 | + |
| 90 | +```bash |
| 91 | +torc watch <workflow_id> \ |
| 92 | + --auto-recover \ # Enable automatic recovery |
| 93 | + --max-retries 3 \ # Maximum recovery attempts |
| 94 | + --memory-multiplier 1.5 \ # Memory increase factor for OOM |
| 95 | + --runtime-multiplier 1.5 \ # Runtime increase factor for timeout |
| 96 | + --poll-interval 60 \ # Seconds between status checks |
| 97 | + --output-dir output \ # Directory for job output files |
| 98 | + --show-job-counts # Display job counts during polling (optional) |
| 99 | +``` |
| 100 | + |
| 101 | +### Retry Limits |
| 102 | + |
| 103 | +The `--max-retries` option prevents infinite retry loops. After exceeding this limit, the system exits with an error, indicating manual intervention is needed. |
| 104 | + |
| 105 | +Default: 3 retries |
| 106 | + |
| 107 | +## When to Use Manual Recovery |
| 108 | + |
| 109 | +Automatic recovery works well for resource-related failures, but some situations require manual intervention: |
| 110 | + |
| 111 | +### Use Manual Recovery When: |
| 112 | + |
| 113 | +1. **Jobs keep failing after max retries** |
| 114 | + - The heuristics aren't solving the problem |
| 115 | + - Need to investigate root cause |
| 116 | + |
| 117 | +2. **Unknown failure modes** |
| 118 | + - Exit codes that don't indicate OOM/timeout |
| 119 | + - Application-specific errors |
| 120 | + |
| 121 | +3. **Code bugs** |
| 122 | + - Jobs fail consistently with same error |
| 123 | + - No resource issue detected |
| 124 | + |
| 125 | +4. **Cost optimization** |
| 126 | + - Want to analyze actual usage before increasing |
| 127 | + - Need to decide whether job is worth more resources |
| 128 | + |
| 129 | +### MCP Server for Manual Recovery |
| 130 | + |
| 131 | +The Torc MCP server provides tools for AI-assisted investigation: |
| 132 | + |
| 133 | +| Tool | Purpose | |
| 134 | +|------|---------| |
| 135 | +| `get_workflow_status` | Get overall workflow status | |
| 136 | +| `list_failed_jobs` | List failed jobs with error info | |
| 137 | +| `get_job_logs` | Read stdout/stderr logs | |
| 138 | +| `check_resource_utilization` | Detailed resource analysis | |
| 139 | +| `update_job_resources` | Manually adjust resources | |
| 140 | +| `restart_jobs` | Reset and restart jobs | |
| 141 | +| `resubmit_workflow` | Regenerate Slurm schedulers | |
| 142 | + |
| 143 | +## Comparison |
| 144 | + |
| 145 | +| Feature | Automatic | Manual/AI-Assisted | |
| 146 | +|---------|-----------|-------------------| |
| 147 | +| Human involvement | None | Interactive | |
| 148 | +| Speed | Fast | Depends on human | |
| 149 | +| Handles OOM/timeout | Yes | Yes | |
| 150 | +| Handles unknown errors | Retry only | Full investigation | |
| 151 | +| Cost optimization | Basic | Can be sophisticated | |
| 152 | +| Use case | Production workflows | Debugging, optimization | |
| 153 | + |
| 154 | +## Implementation Details |
| 155 | + |
| 156 | +### The Watch Command |
| 157 | + |
| 158 | +```bash |
| 159 | +torc watch <workflow_id> --auto-recover |
| 160 | +``` |
| 161 | + |
| 162 | +Main loop: |
| 163 | +1. Poll `is_workflow_complete` API |
| 164 | +2. Print status updates |
| 165 | +3. On completion, check for failures |
| 166 | +4. If failures and auto-recover enabled: |
| 167 | + - Run `torc reports check-resource-utilization --include-failed` |
| 168 | + - Parse results for `likely_oom` and `likely_timeout` flags |
| 169 | + - Update resource requirements via API |
| 170 | + - Run `torc workflows reset-status --failed-only --restart` |
| 171 | + - Run `torc slurm regenerate --submit` |
| 172 | + - Increment retry counter |
| 173 | + - Resume polling |
| 174 | +5. Exit 0 on success, exit 1 on max retries exceeded |
| 175 | + |
| 176 | +### The Regenerate Command |
| 177 | + |
| 178 | +```bash |
| 179 | +torc slurm regenerate <workflow_id> --submit |
| 180 | +``` |
| 181 | + |
| 182 | +1. Query jobs with status uninitialized/ready/blocked |
| 183 | +2. Group by resource requirements |
| 184 | +3. For each group: |
| 185 | + - Find best partition using HPC profile |
| 186 | + - Calculate jobs per node |
| 187 | + - Determine number of allocations needed |
| 188 | + - Create scheduler config |
| 189 | +4. Update jobs with new scheduler reference |
| 190 | +5. Submit allocations via sbatch |
| 191 | + |
| 192 | +## See Also |
| 193 | + |
| 194 | +- [Automatic Failure Recovery Tutorial](../tutorials/automatic-recovery.md) - Step-by-step guide |
| 195 | +- [MCP Server Tutorial](../tutorials/mcp-server.md) - Setting up AI-assisted tools |
| 196 | +- [Resource Monitoring](../how-to/resource-monitoring.md) - Understanding resource tracking |
0 commit comments