|
4 | 4 | GREASY is not supported at CSCS anymore. We recommend using HyperQueue instead. |
5 | 5 |
|
6 | 6 | [HyperQueue](https://it4innovations.github.io/hyperqueue/stable/) is a meta-scheduler designed for high-throughput computing on high-performance computing (HPC) clusters. |
7 | | -It addresses the inefficiency of using traditional schedulers like SLURM for a large number of small, short-lived tasks by allowing you to bundle them into a single, larger SLURM job. |
| 7 | +It addresses the inefficiency of using traditional schedulers like Slurm for a large number of small, short-lived tasks by allowing you to bundle them into a single, larger Slurm job. |
8 | 8 | This approach minimizes scheduling overhead and improves resource utilization. |
9 | 9 |
|
10 | 10 | By using a meta-scheduler like HyperQueue, you get fine-grained control over your tasks within the allocated resources of a single batch job. |
@@ -42,8 +42,8 @@ echo "$(date): end task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_V |
42 | 42 | ``` |
43 | 43 |
|
44 | 44 | [](){#ref-hyperqueue-example-script-simple} |
45 | | -### Simple SLURM batch job script |
46 | | -Next, create a SLURM batch script that will launch the HyperQueue server and workers, submit your tasks, wait for the tasks to finish, and then shut everything down. |
| 45 | +### Simple Slurm batch job script |
| 46 | +Next, create a Slurm batch script that will launch the HyperQueue server and workers, submit your tasks, wait for the tasks to finish, and then shut everything down. |
47 | 47 |
|
48 | 48 | ```bash title="job.sh" |
49 | 49 | #!/usr/local/bin/bash |
@@ -83,7 +83,7 @@ $ sbatch job.sh |
83 | 83 | ``` |
84 | 84 |
|
85 | 85 | [](){#ref-hyperqueue-example-script-advanced} |
86 | | -### More robust SLURM batch job script |
| 86 | +### More robust Slurm batch job script |
87 | 87 | A powerful feature of HyperQueue is the ability to resume a job that was interrupted, for example, by reaching a time limit or a node failure. |
88 | 88 | You can achieve this by using a journal file to save the state of your tasks. |
89 | 89 | By adding a journal file, HyperQueue can track which tasks were completed and which are still pending. |
|
113 | 113 | export JOURNAL=~/.hq-journal-${SLURM_JOBID} |
114 | 114 | fi |
115 | 115 |
|
116 | | -# Ensure each SLURM job has its own HyperQueue server directory |
| 116 | +# Ensure each Slurm job has its own HyperQueue server directory |
117 | 117 | export HQ_SERVER_DIR=~/.hq-server-${SLURM_JOBID} |
118 | 118 |
|
119 | 119 | # Start the HyperQueue server with the journal file |
@@ -155,7 +155,7 @@ To submit a new job, use `sbatch`: |
155 | 155 | $ sbatch job.sh |
156 | 156 | ``` |
157 | 157 |
|
158 | | -If the job fails for any reason, you can resubmit it and tell HyperQueue to pick up where it left off by passing the original SLURM job ID as an argument: |
| 158 | +If the job fails for any reason, you can resubmit it and tell HyperQueue to pick up where it left off by passing the original Slurm job ID as an argument: |
159 | 159 |
|
160 | 160 | ```bash |
161 | 161 | $ sbatch job.sh <job-id> |
|
0 commit comments