Skip to content

Commit 785f8ce

Browse files
authored
Add HyperQueue (#226)
1 parent e83220f commit 785f8ce

File tree

3 files changed

+170
-0
lines changed

3 files changed

+170
-0
lines changed

docs/running/hyperqueue.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
[](){#ref-hyperqueue}
2+
# HyperQueue
3+
!!! info "GREASY"
4+
GREASY is not supported at CSCS anymore.
5+
We recommend using HyperQueue instead.
6+
7+
[HyperQueue](https://it4innovations.github.io/hyperqueue/stable/) is a meta-scheduler designed for high-throughput computing on high-performance computing (HPC) clusters.
8+
It addresses the inefficiency of using traditional schedulers like Slurm for a large number of small, short-lived tasks by allowing you to bundle them into a single, larger Slurm job.
9+
This approach minimizes scheduling overhead and improves resource utilization.
10+
11+
By using a meta-scheduler like HyperQueue, you get fine-grained control over your tasks within the allocated resources of a single batch job.
12+
It's especially useful for workflows that involve numerous tasks, each requiring minimal resources (e.g., a single CPU core or GPU) or a short runtime.
13+
14+
[](){#ref-hyperqueue-setup}
15+
## Setup
16+
Before you can use HyperQueue, you'll need to download it.
17+
No installation is needed as it is a statically linked binary with no external dependencies.
18+
You can download the latest version from the [official site](https://it4innovations.github.io/hyperqueue/stable/installation/).
19+
Because there are different architectures on Alps (ARM and x86_64), we recommend unpacking the binary in `$HOME/.local/<arch>/bin`, as described [here][ref-guides-terminal-arch].
20+
21+
[](){#ref-hyperqueue-example}
22+
## Example workflow
23+
This example demonstrates a basic HyperQueue workflow by running a large number of "hello world" tasks, some on a CPU and others on a GPU.
24+
25+
[](){#ref-hyperqueue-example-script-task}
26+
### The task script
27+
First, create a simple script that represents the individual tasks you want to run.
28+
This script will be executed by HyperQueue workers.
29+
30+
```bash title="task.sh"
31+
#!/usr/local/bin/bash
32+
33+
# This script is a single task that will be run by HyperQueue.
34+
# HQ_TASK_ID is an environment variable set by HyperQueue for each task.
35+
# See HyperQueue documentation for other variables set by HyperQueue
36+
37+
echo "$(date): start task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
38+
39+
# Simulate some work
40+
sleep 30
41+
42+
echo "$(date): end task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
43+
```
44+
45+
[](){#ref-hyperqueue-example-script-simple}
46+
### Simple Slurm batch job script
47+
Next, create a Slurm batch script that will launch the HyperQueue server and workers, submit your tasks, wait for the tasks to finish, and then shut everything down.
48+
49+
```bash title="job.sh"
50+
#!/usr/local/bin/bash
51+
52+
#SBATCH --nodes 2
53+
#SBATCH --ntasks-per-node 1
54+
#SBATCH --time 00:10:00
55+
#SBATCH --partition normal
56+
#SBATCH --account <account>
57+
58+
# Start HyperQueue server and workers
59+
hq server start &
60+
61+
# Wait for the server to be ready
62+
hq server wait
63+
64+
# Start HyperQueue workers
65+
srun hq worker start &
66+
67+
# Submit tasks (300 CPU tasks and 16 GPU tasks)
68+
hq submit --resource "cpus=1" --array 1-300 ./task.sh;
69+
hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh;
70+
71+
# Wait for all jobs to finish
72+
hq job wait all
73+
74+
# Stop HyperQueue server and workers
75+
hq server stop
76+
77+
echo
78+
echo "Everything done!"
79+
```
80+
81+
To submit this job, use `sbatch`:
82+
```bash
83+
sbatch job.sh
84+
```
85+
86+
[](){#ref-hyperqueue-example-script-advanced}
87+
### More robust Slurm batch job script
88+
A powerful feature of HyperQueue is the ability to resume a job that was interrupted, for example, by reaching a time limit or a node failure.
89+
You can achieve this by using a journal file to save the state of your tasks.
90+
By adding a journal file, HyperQueue can track which tasks were completed and which are still pending.
91+
When you restart the job, it will only run the unfinished tasks.
92+
93+
Another useful feature is running multiple servers simultaneously.
94+
This can be achieved by starting each server with unique directory set in the variable `HQ_SERVER_DIR`.
95+
96+
Here's an improved version of the batch script that incorporates these features:
97+
98+
```bash title="job.sh"
99+
#!/usr/local/bin/bash
100+
101+
#SBATCH --nodes 2
102+
#SBATCH --ntasks-per-node 1
103+
#SBATCH --time 00:10:00
104+
#SBATCH --partition normal
105+
#SBATCH --account <account>
106+
107+
# Set up the journal file for state tracking
108+
# If an argument is provided, use it to restore a previous job
109+
# Otherwise, create a new journal file for the current job
110+
RESTORE_JOB=$1
111+
if [ -n "$RESTORE_JOB" ]; then
112+
export JOURNAL=~/.hq-journal-${RESTORE_JOB}
113+
else
114+
export JOURNAL=~/.hq-journal-${SLURM_JOBID}
115+
fi
116+
117+
# Ensure each Slurm job has its own HyperQueue server directory
118+
export HQ_SERVER_DIR=~/.hq-server-${SLURM_JOBID}
119+
120+
# Start the HyperQueue server with the journal file
121+
hq server start --journal=${JOURNAL} &
122+
123+
# Wait for the server to be ready
124+
hq server wait --timeout=120
125+
if [ "$?" -ne 0 ]; then
126+
echo "Server did not start, exiting ..."
127+
exit 1
128+
fi
129+
130+
# Start HyperQueue workers
131+
srun hq worker start &
132+
133+
# Submit tasks only if we are not restoring a previous job
134+
# (300 CPU tasks and 16 GPU tasks)
135+
if [ -z "$RESTORE_JOB" ]; then
136+
hq submit --resource "cpus=1" --array 1-300 ./task.sh;
137+
hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh;
138+
fi
139+
140+
# Wait for all jobs to finish
141+
hq job wait all
142+
143+
# Stop HyperQueue server and workers
144+
hq server stop
145+
146+
# Clean up server directory and journal file
147+
rm -rf ${HQ_SERVER_DIR}
148+
rm -rf ${JOURNAL}
149+
150+
echo
151+
echo "Everything done!"
152+
```
153+
154+
To submit a new job, use `sbatch`:
155+
```bash
156+
sbatch job.sh
157+
```
158+
159+
If the job fails for any reason, you can resubmit it and tell HyperQueue to pick up where it left off by passing the original Slurm job ID as an argument:
160+
161+
```bash
162+
sbatch job.sh <job-id>
163+
```
164+
165+
The script will detect the argument, load the journal file from the previous run, and only execute the tasks that haven't been completed.
166+
167+
!!! info "External references"
168+
You can find other features and examples in the HyperQueue [documentation](https://it4innovations.github.io/hyperqueue/stable/).

docs/running/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Running Jobs
22

33
[Slurm][ref-slurm] is used on CSCS systems to schedule jobs.
4+
For scheduling many small jobs (1 core or short time) we recommend [HyperQueue][ref-hyperqueue].
45
The [job report tool][ref-jobreport] can be used in Slurm jobs to collect reports on how well an application uses the system.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ nav:
115115
- 'Running Jobs':
116116
- running/index.md
117117
- 'Slurm': running/slurm.md
118+
- 'HyperQueue': running/hyperqueue.md
118119
- 'Job report': running/jobreport.md
119120
- 'Known issues': running/known-issues.md
120121
- 'Data Management and Storage':

0 commit comments

Comments
 (0)