Skip to content

Commit 62abb53

Browse files
authored
Prototype: AI-assisted error recovery (#96)
* Prototype: AI-assisted error recovery
1 parent b982db3 commit 62abb53

40 files changed

+2992
-431
lines changed

CLAUDE.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@ The Rust client provides a **unified CLI and library interface** with these key
251251

252252
### Job Status as Integer
253253

254-
Job status values are stored as INTEGER (0-9) in the database, not strings:
254+
Job status values are stored as INTEGER (0-10) in the database, not strings:
255255

256256
- 0 = uninitialized
257257
- 1 = blocked
@@ -263,6 +263,7 @@ Job status values are stored as INTEGER (0-9) in the database, not strings:
263263
- 7 = canceled
264264
- 8 = terminated
265265
- 9 = disabled
266+
- 10 = pending_failed (awaiting AI classification)
266267

267268
### Resource Formats
268269

api/openapi.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6030,6 +6030,16 @@ components:
60306030
type: integer
60316031
jobs_sort_method:
60326032
$ref: "#/components/schemas/jobs_sort_method"
6033+
resource_monitor_config:
6034+
description: Resource monitoring configuration as JSON string
6035+
type: string
6036+
slurm_defaults:
6037+
description: Default Slurm parameters to apply to all schedulers as JSON string
6038+
type: string
6039+
use_pending_failed:
6040+
default: false
6041+
description: Use PendingFailed status for failed jobs (enables AI-assisted recovery)
6042+
type: boolean
60336043
status_id:
60346044
type: integer
60356045
required:

docs/src/SUMMARY.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
- [Overview](./getting-started/getting-started.md)
88
- [Installation](./getting-started/installation.md)
99
- [Quick Start (Local)](./getting-started/quick-start-local.md)
10+
- [Quick Start (HPC/Slurm)](./getting-started/quick-start-hpc.md)
11+
- [Quick Start (Remote Workers)](./getting-started/quick-start-remote.md)
1012

1113
# Core Documentation
1214

@@ -25,7 +27,6 @@
2527
- [Exporting and Importing Workflows](./core/workflows/export-import-workflows.md)
2628
- [Archiving Workflows](./core/workflows/archiving.md)
2729
- [How-Tos](./core/how-to/index.md)
28-
- [Submit a Workflow to Slurm](./core/how-to/submit-slurm-workflow.md)
2930
- [Track Workflow Status](./core/how-to/track-workflow-status.md)
3031
- [Cancel a Workflow](./core/how-to/cancel-workflow.md)
3132
- [View Job Logs](./core/how-to/view-job-logs.md)
@@ -58,10 +59,10 @@
5859

5960
---
6061

61-
# Specialized Topics
62+
# Execution Modes
6263

6364
- [HPC & Slurm](./specialized/hpc/index.md)
64-
- [Quick Start (HPC)](./specialized/hpc/quick-start-hpc.md)
65+
- [Submit a Workflow to Slurm](./specialized/hpc/submit-slurm-workflow.md)
6566
- [Slurm Workflows](./specialized/hpc/slurm-workflows.md)
6667
- [Debugging Slurm Workflows](./specialized/hpc/debugging-slurm.md)
6768
- [Working with Slurm](./specialized/hpc/slurm.md)
@@ -70,11 +71,16 @@
7071
- [HPC Deployment](./specialized/hpc/hpc-deployment.md)
7172
- [Custom HPC Profile](./specialized/hpc/custom-hpc-profile.md)
7273
- [Remote Workers](./specialized/remote/index.md)
73-
- [Quick Start (Remote Workers)](./specialized/remote/quick-start-remote.md)
7474
- [Setting Up Remote Workers](./specialized/remote/remote-workers.md)
75+
76+
---
77+
78+
# Advanced Topics
79+
7580
- [Fault Tolerance & Recovery](./specialized/fault-tolerance/index.md)
7681
- [Automatic Failure Recovery](./specialized/fault-tolerance/automatic-recovery.md)
7782
- [Configurable Failure Handlers](./specialized/fault-tolerance/failure-handlers.md)
83+
- [AI-Assisted Recovery](./specialized/fault-tolerance/ai-assisted-recovery.md)
7884
- [Job Checkpointing](./specialized/fault-tolerance/checkpointing.md)
7985
- [Administration & Security](./specialized/admin/index.md)
8086
- [Server Deployment](./specialized/admin/server-deployment.md)
@@ -100,6 +106,7 @@
100106
- [Central Database](./specialized/design/database.md)
101107
- [Workflow Recovery Design](./specialized/design/recovery.md)
102108
- [Failure Handler Design](./specialized/design/failure-handlers.md)
109+
- [AI-Assisted Recovery Design](./specialized/design/ai-assisted-recovery.md)
103110
- [Workflow Graph](./specialized/design/workflow-graph.md)
104111
- [Interface Architecture](./specialized/design/interfaces.md)
105112

docs/src/core/concepts/job-states.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,34 +7,44 @@ stateDiagram-v2
77
[*] --> uninitialized
88
uninitialized --> ready: initialize_jobs
99
uninitialized --> blocked: has dependencies
10+
uninitialized --> disabled: job disabled
1011
1112
blocked --> ready: dependencies met
1213
ready --> pending: runner claims
1314
pending --> running: execution starts
1415
1516
running --> completed: exit 0
16-
running --> failed: exit != 0
17+
running --> failed: exit != 0 (handler match + max retries)
18+
running --> pending_failed: exit != 0 (no handler match)
19+
running --> ready: exit != 0 (failure handler retry)
1720
running --> canceled: user cancels
1821
running --> terminated: system terminates
1922
23+
pending_failed --> failed: AI classifies as permanent
24+
pending_failed --> ready: AI classifies as transient
25+
pending_failed --> uninitialized: reset-status
26+
2027
completed --> [*]
2128
failed --> [*]
2229
canceled --> [*]
2330
terminated --> [*]
31+
disabled --> [*]
2432
2533
classDef waiting fill:#6c757d,color:#fff
2634
classDef ready fill:#17a2b8,color:#fff
2735
classDef active fill:#ffc107,color:#000
2836
classDef success fill:#28a745,color:#fff
2937
classDef error fill:#dc3545,color:#fff
3038
classDef stopped fill:#6f42c1,color:#fff
39+
classDef classification fill:#fd7e14,color:#fff
3140
3241
class uninitialized,blocked waiting
3342
class ready ready
3443
class pending,running active
3544
class completed success
3645
class failed error
37-
class canceled,terminated stopped
46+
class canceled,terminated,disabled stopped
47+
class pending_failed classification
3848
```
3949

4050
## State Descriptions
@@ -48,3 +58,7 @@ stateDiagram-v2
4858
- **failed** (6) - Finished with error (exit code != 0)
4959
- **canceled** (7) - Explicitly canceled by user or torc. Never executed.
5060
- **terminated** (8) - Explicitly terminated by system, such as at wall-time timeout
61+
- **disabled** (9) - Job is disabled and will not run
62+
- **pending_failed** (10) - Job failed without a matching failure handler. Awaiting AI-assisted
63+
classification to determine if the error is transient (retry) or permanent (fail). See
64+
[AI-Assisted Recovery](../specialized/fault-tolerance/ai-assisted-recovery.md).

docs/src/core/concepts/workflow-definition.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,40 @@ This creates:
125125
- 10 parallel `process_*` jobs
126126
- 1 `aggregate` job that waits for all 10 to complete
127127

128+
## Failure Recovery Options
129+
130+
Control how Torc handles job failures:
131+
132+
### Default Behavior
133+
134+
By default, jobs that fail without a matching failure handler use `Failed` status:
135+
136+
```yaml
137+
name: my_workflow
138+
jobs:
139+
- name: task
140+
command: ./run.sh # If this fails, status = Failed
141+
```
142+
143+
### AI-Assisted Recovery (Opt-in)
144+
145+
Enable intelligent classification of ambiguous failures:
146+
147+
```yaml
148+
name: ml_training
149+
use_pending_failed: true # Enable AI-assisted recovery
150+
151+
jobs:
152+
- name: train_model
153+
command: python train.py
154+
```
155+
156+
With `use_pending_failed: true`:
157+
158+
- Jobs without matching failure handlers get `PendingFailed` status
159+
- AI agent can analyze stderr and decide whether to retry or fail
160+
- See [AI-Assisted Recovery](../../specialized/fault-tolerance/ai-assisted-recovery.md) for details
161+
128162
## See Also
129163

130164
- [Workflow Specification Formats](../workflows/workflow-formats.md) — Complete syntax reference

docs/src/core/how-to/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
Step-by-step guides for common tasks.
44

5-
- [Submit a Workflow to Slurm](./submit-slurm-workflow.md) - Running workflows on HPC clusters
65
- [Track Workflow Status](./track-workflow-status.md) - Monitoring workflow progress
76
- [Cancel a Workflow](./cancel-workflow.md) - Stopping running workflows
87
- [View Job Logs](./view-job-logs.md) - Accessing job output
Lines changed: 1 addition & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1 @@
1-
# How to Submit a Workflow to Slurm
2-
3-
Submit a workflow specification to a Slurm-based HPC system with automatic scheduler generation.
4-
5-
## Quick Start
6-
7-
```bash
8-
torc submit-slurm --account <your-account> workflow.yaml
9-
```
10-
11-
Torc will:
12-
13-
1. Detect your HPC system (e.g., NREL Kestrel, Eagle)
14-
2. Match job requirements to appropriate partitions
15-
3. Generate Slurm scheduler configurations
16-
4. Submit everything for execution
17-
18-
## Preview Before Submitting
19-
20-
Always preview the generated configuration first:
21-
22-
```bash
23-
torc slurm generate --account <your-account> workflow.yaml
24-
```
25-
26-
This shows the Slurm schedulers and workflow actions that would be created without submitting.
27-
28-
## Requirements
29-
30-
Your workflow must define resource requirements for jobs:
31-
32-
```yaml
33-
name: my_workflow
34-
35-
resource_requirements:
36-
- name: standard
37-
num_cpus: 4
38-
memory: 8g
39-
runtime: PT1H
40-
41-
jobs:
42-
- name: process_data
43-
command: python process.py
44-
resource_requirements: standard
45-
```
46-
47-
## Options
48-
49-
```bash
50-
# See all options
51-
torc submit-slurm --help
52-
```
53-
54-
## See Also
55-
56-
- [Slurm Workflows](../../specialized/hpc/slurm-workflows.md) — Full Slurm integration guide
57-
- [HPC Profiles](../../specialized/hpc/hpc-profiles.md) — Available HPC system configurations
1+
# Submit a Workflow to Slurm

docs/src/getting-started/getting-started.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,45 @@ The repository includes ready-to-run workflow specifications in YAML, JSON5, and
7777

7878
See the [examples README](https://github.com/NREL/torc/tree/main/examples) for the complete list.
7979

80-
## Next: Quick Start
80+
## Choose Your Execution Mode
81+
82+
Torc supports three fundamentally different execution environments. Choose the one that matches your
83+
use case:
84+
85+
### Local Execution
86+
87+
**Best for:** Development, testing, small-scale workflows on your workstation or a single server
88+
89+
- Jobs run directly on the machine where you start the job runner
90+
- No scheduler needed — simple setup with `torc run`
91+
- Resource management via local CPU/memory/GPU tracking
92+
- **[Quick Start (Local)](quick-start-local.md)**
93+
94+
### HPC/Slurm
95+
96+
**Best for:** Large-scale computations on institutional HPC clusters
97+
98+
- Jobs submitted to Slurm scheduler for compute node allocation
99+
- Automatic resource matching to partitions/QOS
100+
- Built-in profiles for common HPC systems
101+
- **[Quick Start (HPC/Slurm)](quick-start-hpc.md)**
102+
103+
### Remote Workers
104+
105+
**Best for:** Distributed execution across multiple machines you control via SSH
106+
107+
- Jobs distributed to remote workers over SSH
108+
- No HPC scheduler required — you manage the machines
109+
- Flexible heterogeneous resources (mix of CPU/GPU machines)
110+
- **[Quick Start (Remote Workers)](quick-start-remote.md)**
111+
112+
---
113+
114+
**All three modes:**
115+
116+
- Share the same workflow specification format
117+
- Use the same server API for coordination
118+
- Support the same monitoring tools (CLI, TUI, Dashboard)
119+
- Can be used together (e.g., develop locally, deploy to HPC)
81120

82121
Continue to the [Quick Start](./quick-start.md) guide to run your first workflow.

docs/src/getting-started/installation.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,34 @@
22

33
## Precompiled Binaries (Recommended)
44

5-
Download precompiled binaries from the [releases page](https://github.com/NREL/torc/releases).
5+
1. Download the appropriate archive for your platform from the
6+
[releases page](https://github.com/NREL/torc/releases):
7+
- **Linux**: `torc-<version>-x86_64-unknown-linux-gnu.tar.gz`
8+
- **macOS (Intel)**: `torc-<version>-x86_64-apple-darwin.tar.gz`
9+
- **macOS (Apple Silicon)**: `torc-<version>-aarch64-apple-darwin.tar.gz`
10+
11+
2. Extract the archive:
12+
13+
```bash
14+
# For .tar.gz files
15+
tar -xzf torc-<version>-<platform>.tar.gz
16+
17+
# For .zip files
18+
unzip torc-<version>-<platform>.zip
19+
```
20+
21+
3. Add the binaries to a directory in your system PATH:
22+
23+
```bash
24+
# Option 1: Copy to an existing PATH directory
25+
cp torc* ~/.local/bin/
26+
27+
# Option 2: Add the extracted directory to your PATH
28+
export PATH="/path/to/extracted/torc:$PATH"
29+
```
30+
31+
To make the PATH change permanent, add the export line to your shell configuration file
32+
(`~/.bashrc`, `~/.zshrc`, etc.).
633

734
**macOS users**: The precompiled binaries are not signed with an Apple Developer certificate. macOS
835
Gatekeeper will block them by default. To allow the binaries to run, remove the quarantine attribute

0 commit comments

Comments
 (0)