Skip to content

Commit d95b036

Browse files
authored
Merge pull request #30 from NREL/shared-parameters
Allow shared parameters in spec files
2 parents 8357a34 + 8097bb4 commit d95b036

36 files changed

+1353
-425
lines changed

CLAUDE.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,6 @@ torc/
4141
│ ├── lib.rs # Library root
4242
│ └── models.rs # Shared data models
4343
├── torc-server/ # Standalone server binary
44-
├── torc-tui/ # Standalone TUI binary
45-
├── torc-plot-resources/ # Standalone plotting binary
4644
├── torc-slurm-job-runner/ # Slurm job runner binary
4745
├── python_client/ # Python CLI client and library
4846
│ ├── src/torc/ # Python package

docs/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ src/
7676
│ ├── job-states.md
7777
│ ├── reinitialization.md
7878
│ ├── dependencies.md
79-
│ └── ready-queue.md
8079
8180
├── how-to/ # Problem-oriented
8281
│ ├── README.md

docs/src/SUMMARY.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
- [Design](./explanation/design/README.md)
2525
- [Server API Handler](./explanation/design/server.md)
2626
- [Central Database](./explanation/design/database.md)
27-
- [Ready Queue](./explanation/design/ready-queue.md)
2827

2928
# How-To Guides
3029

docs/src/explanation/dependencies.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ jobs:
1313
- name: analyze
1414
command: analyze.sh
1515
blocked_by:
16-
- job1
16+
- preprocess
1717
```
1818
1919
## 2. Implicit Dependencies

docs/src/explanation/environment-variables.md

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,6 @@ curl -X POST "${TORC_API_URL}/files" \
5353
}"
5454
```
5555

56-
## Implementation Details
57-
58-
These environment variables are set by the job runner when spawning job processes. The implementation can be found in `src/client/async_cli_command.rs` in the `start()` method.
59-
6056
## Complete Example
6157

6258
Here's a complete example of a job that uses all three environment variables:
@@ -83,16 +79,6 @@ jobs:
8379
# Do some work
8480
echo "Processing data..." > "${OUTPUT_DIR}/status.txt"
8581
date >> "${OUTPUT_DIR}/status.txt"
86-
87-
# Register the output file with Torc
88-
curl -X POST "${TORC_API_URL}/files" \
89-
-H "Content-Type: application/json" \
90-
-d "{
91-
\"workflow_id\": ${TORC_WORKFLOW_ID},
92-
\"name\": \"job_${TORC_JOB_ID}_output\",
93-
\"path\": \"${OUTPUT_DIR}/status.txt\"
94-
}"
95-
9682
echo "Job completed successfully!"
9783
```
9884

docs/src/explanation/job-runners.md

Lines changed: 33 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,28 @@ The job runner supports two different strategies for retrieving and executing jo
1616
**Used when**: `--max-parallel-jobs` is NOT specified
1717

1818
**Behavior**:
19-
- Retrieves jobs from the server via `GET /workflows/{id}/claim_jobs_based_on_resources`
19+
- Retrieves jobs from the server via the command `claim_jobs_based_on_resources`
2020
- Server filters jobs based on available compute node resources (CPU, memory, GPU)
2121
- Only returns jobs that fit within the current resource capacity
2222
- Prevents resource over-subscription and ensures jobs have required resources
23-
- Defaults to requiring one CPU for each job.
23+
- Defaults to requiring one CPU an 1 MB of memory for each job.
2424

25-
**Use case**: When you have heterogeneous jobs with different resource requirements and want
25+
**Use cases**:
26+
- When you want parallelization based on one CPU per job.
27+
- When you have heterogeneous jobs with different resource requirements and want
2628
intelligent resource management.
2729

28-
**Example**:
30+
**Example 1: Run jobs at queue depth of num_cpus**:
31+
```yaml
32+
parameters:
33+
i: "1..100"
34+
jobs:
35+
- name: "work_{i}"
36+
command: bash my_script.sh {i}
37+
use_parameters: {i}
38+
```
39+
40+
**Example 2: Resource-based parallelization**:
2941
```yaml
3042
resource_requirements:
3143
- name: "work_resources"
@@ -34,24 +46,28 @@ resource_requirements:
3446
runtime: "PT4H"
3547
num_nodes: 1
3648

49+
parameters:
50+
i: "1..100"
3751
jobs:
38-
- name: "work1"
39-
command: bash my_script.sh
52+
- name: "work_{i}"
53+
command: bash my_script.sh {i}
4054
resource_requirements: work_resources
55+
use_parameters: {i}
4156
```
4257
4358
### Simple Queue-Based Allocation
4459
4560
**Used when**: `--max-parallel-jobs` is specified
4661

4762
**Behavior**:
48-
- Retrieves jobs from the server via `GET /workflows/{id}/claim_next_jobs`
63+
- Retrieves jobs from the server via the command `claim_next_jobs`
4964
- Server returns the next N ready jobs from the queue (up to the specified limit)
5065
- Ignores job resource requirements completely
5166
- Simply limits the number of concurrent jobs
5267

53-
**Use case**: When all jobs have similar resource needs or when the resource bottleneck is not
54-
tracked by Torc, such as network or storage I/O.
68+
**Use cases**: When all jobs have similar resource needs or when the resource bottleneck is not
69+
tracked by Torc, such as network or storage I/O. This is the only way to run jobs at a queue
70+
depth higher than the number of CPUs in the worker.
5571

5672
**Example**:
5773
```bash
@@ -66,16 +82,17 @@ The job runner executes a continuous loop with these steps:
6682

6783
1. **Check workflow status** - Poll server to check if workflow is complete or canceled
6884
2. **Monitor running jobs** - Check status of currently executing jobs
69-
3. **Execute workflow actions** - Check for and execute any pending workflow actions
85+
3. **Execute workflow actions** - Check for and execute any pending workflow actions, such as
86+
scheduling new Slurm allocations.
7087
4. **Claim new jobs** - Request ready jobs from server based on allocation strategy:
71-
- Resource-based: `GET /workflows/{id}/claim_jobs_based_on_resources`
72-
- Queue-based: `GET /workflows/{id}/claim_next_jobs`
88+
- Resource-based: `claim_jobs_based_on_resources`
89+
- Queue-based: `claim_next_jobs`
7390
5. **Start jobs** - For each claimed job:
74-
- Call `POST /jobs/{id}/start_job` to mark job as started in database
75-
- Execute job command using `AsyncCliCommand` (non-blocking subprocess)
76-
- Track stdout/stderr output to files
91+
- Call `start_job` to mark job as started in database
92+
- Execute job command in a non-blocking subprocess
93+
- Record stdout/stderr output to files
7794
6. **Complete jobs** - When running jobs finish:
78-
- Call `POST /jobs/{id}/complete_job` with exit code and result
95+
- Call `complete_job` with exit code and result
7996
- Server updates job status and automatically marks dependent jobs as ready
8097
7. **Sleep and repeat** - Wait for job completion poll interval, then repeat loop
8198

docs/src/explanation/job-states.md

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -33,13 +33,6 @@ stateDiagram-v2
3333
- **completed** (5) - Finished successfully (exit code 0)
3434
- **failed** (6) - Finished with error (exit code != 0)
3535
- **canceled** (7) - Explicitly canceled by user or system
36-
- **terminated** (8) - Explicitly terminated by user or system
36+
- **terminated** (8) - Explicitly terminated by system, such as for checkpointing before
37+
wall-time timeout
3738
- **disabled** (9) - Explicitly disabled by user
38-
39-
## Critical State Transitions
40-
41-
1. **initialize_jobs** - Evaluates all dependencies and sets jobs to `ready` or `blocked`
42-
2. **manage_status_change** - Updates job status and triggers cascade effects:
43-
- When a job completes, checks if blocked jobs become ready
44-
- Updates workflow status when all jobs complete
45-
- Handles `cancel_on_blocking_job_failure` flag

docs/src/explanation/reinitialization.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ Reinitialization allows workflows to be rerun when inputs change.
1515

1616
The `process_changed_job_inputs` endpoint implements hash-based change detection:
1717

18-
1. For each job, compute SHA256 hash of all inputs (files + user_data).
18+
1. For each job, compute SHA256 hash of all input parameters. **Note**: files are tracked by
19+
modification times, not hashes. User data records are hashed.
1920
2. Compare to stored hash in the database.
2021
3. If hash differs, mark job as `uninitialized`.
2122
4. All updates happen in a single database transaction (all-or-none).

0 commit comments

Comments
 (0)