@@ -16,16 +16,28 @@ The job runner supports two different strategies for retrieving and executing jo
1616** Used when** : ` --max-parallel-jobs ` is NOT specified
1717
1818** Behavior** :
19- - Retrieves jobs from the server via ` GET /workflows/{id}/ claim_jobs_based_on_resources`
19+ - Retrieves jobs from the server via the command ` claim_jobs_based_on_resources `
2020- Server filters jobs based on available compute node resources (CPU, memory, GPU)
2121- Only returns jobs that fit within the current resource capacity
2222- Prevents resource over-subscription and ensures jobs have required resources
23- - Defaults to requiring one CPU for each job.
23+ - Defaults to requiring one CPU an 1 MB of memory for each job.
2424
25- ** Use case** : When you have heterogeneous jobs with different resource requirements and want
25+ ** Use cases** :
26+ - When you want parallelization based on one CPU per job.
27+ - When you have heterogeneous jobs with different resource requirements and want
2628intelligent resource management.
2729
28- ** Example** :
30+ ** Example 1: Run jobs at queue depth of num_cpus** :
31+ ``` yaml
32+ parameters :
33+ i : " 1..100"
34+ jobs :
35+ - name : " work_{i}"
36+ command : bash my_script.sh {i}
37+ use_parameters : {i}
38+ ` ` `
39+
40+ **Example 2: Resource-based parallelization**:
2941` ` ` yaml
3042resource_requirements :
3143 - name : " work_resources"
@@ -34,24 +46,28 @@ resource_requirements:
3446 runtime : " PT4H"
3547 num_nodes : 1
3648
49+ parameters :
50+ i : " 1..100"
3751jobs :
38- - name : " work1 "
39- command : bash my_script.sh
52+ - name : " work_{i} "
53+ command : bash my_script.sh {i}
4054 resource_requirements : work_resources
55+ use_parameters : {i}
4156` ` `
4257
4358### Simple Queue-Based Allocation
4459
4560**Used when**: ` --max-parallel-jobs` is specified
4661
4762**Behavior**:
48- - Retrieves jobs from the server via `GET /workflows/{id}/ claim_next_jobs`
63+ - Retrieves jobs from the server via the command ` claim_next_jobs`
4964- Server returns the next N ready jobs from the queue (up to the specified limit)
5065- Ignores job resource requirements completely
5166- Simply limits the number of concurrent jobs
5267
53- **Use case**: When all jobs have similar resource needs or when the resource bottleneck is not
54- tracked by Torc, such as network or storage I/O.
68+ **Use cases**: When all jobs have similar resource needs or when the resource bottleneck is not
69+ tracked by Torc, such as network or storage I/O. This is the only way to run jobs at a queue
70+ depth higher than the number of CPUs in the worker.
5571
5672**Example**:
5773` ` ` bash
@@ -66,16 +82,17 @@ The job runner executes a continuous loop with these steps:
6682
67831. **Check workflow status** - Poll server to check if workflow is complete or canceled
68842. **Monitor running jobs** - Check status of currently executing jobs
69- 3. **Execute workflow actions** - Check for and execute any pending workflow actions
85+ 3. **Execute workflow actions** - Check for and execute any pending workflow actions, such as
86+ scheduling new Slurm allocations.
70874. **Claim new jobs** - Request ready jobs from server based on allocation strategy :
71- - Resource-based : ` GET /workflows/{id}/ claim_jobs_based_on_resources`
72- - Queue-based : ` GET /workflows/{id}/ claim_next_jobs`
88+ - Resource-based : ` claim_jobs_based_on_resources`
89+ - Queue-based : ` claim_next_jobs`
73905. **Start jobs** - For each claimed job :
74- - Call `POST /jobs/{id}/ start_job` to mark job as started in database
75- - Execute job command using `AsyncCliCommand` ( non-blocking subprocess)
76- - Track stdout/stderr output to files
91+ - Call `start_job` to mark job as started in database
92+ - Execute job command in a non-blocking subprocess
93+ - Record stdout/stderr output to files
77946. **Complete jobs** - When running jobs finish :
78- - Call `POST /jobs/{id}/ complete_job` with exit code and result
95+ - Call `complete_job` with exit code and result
7996 - Server updates job status and automatically marks dependent jobs as ready
80977. **Sleep and repeat** - Wait for job completion poll interval, then repeat loop
8198
0 commit comments