Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions docs/running/hyperqueue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
[](){#ref-hyperqueue}
# HyperQueue
[HyperQueue](https://it4innovations.github.io/hyperqueue/stable/) is a meta-scheduler designed for high-throughput computing on high-performance computing (HPC) clusters.
It addresses the inefficiency of using traditional schedulers like SLURM for a large number of small, short-lived tasks by allowing you to bundle them into a single, larger SLURM job.
This approach minimizes scheduling overhead and improves resource utilization.

By using a meta-scheduler like HyperQueue, you get fine-grained control over your tasks within the allocated resources of a single batch job.
It's especially useful for workflows that involve numerous tasks, each requiring minimal resources (e.g., a single CPU core or GPU) or a short runtime.

[](){#ref-hyperqueue-setup}
## Setup
Before you can use HyperQueue, you'll need to download it. No installation is needed as it is a statically linked binary with no external dependencies. Here’s how to set it up in your home directory:

```bash
$ cd ~/bin
$ wget https://github.com/It4innovations/hyperqueue/releases/download/v0.23.0/hq-v0.23.0-linux-arm64-linux.tar.gz
$ tar -zxf hq-v0.23.0-linux-arm64-linux.tar.gz
$ rm hq-v0.23.0-linux-arm64-linux.tar.gz
```

To make the `hq` command available in your current session, add it to your `PATH` environment variable:

```bash
$ export PATH=~/bin:$PATH
```
You can also add this line to your `~/.bashrc` or `~/.bash_profile` to make the change permanent.

[](){#ref-hyperqueue-example}
## Example workflow
This example demonstrates a basic HyperQueue workflow by running a large number of "hello world" tasks, some on a CPU and others on a GPU.

[](){#ref-hyperqueue-example-script-task}
### The task script
First, create a simple script that represents the individual tasks you want to run.
This script will be executed by HyperQueue workers.

```bash title="task.sh"
#!/usr/local/bin/bash

# This script is a single task that will be run by HyperQueue.
# HQ_TASK_ID is an environment variable set by HyperQueue for each task.
# See HyperQueue documentation for other variables set by HyperQueue

echo "$(date): start task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"

# Simulate some work
sleep 30

echo "$(date): end task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
```

[](){#ref-hyperqueue-example-script-simple}
### Simple SLURM batch job script
Next, create a SLURM batch script that will launch the HyperQueue server and workers, submit your tasks, wait for the tasks to finish, and then shut everything down.

```bash title="job.sh"
#!/usr/local/bin/bash

#SBATCH --nodes 2
#SBATCH --ntasks-per-node 1
#SBATCH --time 00:10:00
#SBATCH --partition normal
#SBATCH --account <account>

# Start HyperQueue server and workers
hq server start &

# Wait for the server to be ready
hq server wait

# Start HyperQueue workers
srun hq worker start &

# Submit tasks (300 CPU tasks and 16 GPU tasks)
hq submit --resource "cpus=1" --array 1-300 ./task.sh;
hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh;

# Wait for all jobs to finish
hq job wait all

# Stop HyperQueue server and workers
hq server stop

echo
echo "Everything done!"
```

To submit this job, use `sbatch`:
```bash
$ sbatch job.sh
```

[](){#ref-hyperqueue-example-script-advanced}
### More robust SLURM batch job script
A powerful feature of HyperQueue is the ability to resume a job that was interrupted, for example, by reaching a time limit or a node failure.
You can achieve this by using a journal file to save the state of your tasks.
By adding a journal file, HyperQueue can track which tasks were completed and which are still pending.
When you restart the job, it will only run the unfinished tasks.

Another useful feature is running multiple servers simultaneously.
This can be achieved by starting each server with unique directory set in the variable `HQ_SERVER_DIR`.

Here's an improved version of the batch script that incorporates these features:

```bash title="job.sh"
#!/usr/local/bin/bash

#SBATCH --nodes 2
#SBATCH --ntasks-per-node 1
#SBATCH --time 00:10:00
#SBATCH --partition normal
#SBATCH --account <account>

# Set up the journal file for state tracking
# If an argument is provided, use it to restore a previous job
# Otherwise, create a new journal file for the current job
RESTORE_JOB=$1
if [ -n "$RESTORE_JOB" ]; then
export JOURNAL=~/.hq-journal-${RESTORE_JOB}
else
export JOURNAL=~/.hq-journal-${SLURM_JOBID}
fi

# Ensure each SLURM job has its own HyperQueue server directory
export HQ_SERVER_DIR=~/.hq-server-${SLURM_JOBID}

# Start the HyperQueue server with the journal file
hq server start --journal=${JOURNAL} &

# Wait for the server to be ready
hq server wait --timeout=120
if [ "$?" -ne 0 ]; then
echo "Server did not start, exiting ..."
exit 1
fi

# Start HyperQueue workers
srun hq worker start &

# Submit tasks only if we are not restoring a previous job
# (300 CPU tasks and 16 GPU tasks)
if [ -z "$RESTORE_JOB" ]; then
hq submit --resource "cpus=1" --array 1-300 ./task.sh;
hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh;
fi

# Wait for all jobs to finish
hq job wait all

# Stop HyperQueue server and workers
hq server stop

# Clean up server directory and journal file
rm -rf ${HQ_SERVER_DIR}
rm -rf ${JOURNAL}

echo
echo "Everything done!"
```

To submit a new job, use `sbatch`:
```bash
$ sbatch job.sh
```

If the job fails for any reason, you can resubmit it and tell HyperQueue to pick up where it left off by passing the original SLURM job ID as an argument:

```bash
$ sbatch job.sh <job-id>
```

The script will detect the argument, load the journal file from the previous run, and only execute the tasks that haven't been completed.

!!! info "External references"
You can find other features and examples in the HyperQueue [documentation](https://it4innovations.github.io/hyperqueue/stable/).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ nav:
- 'Running Jobs':
- running/index.md
- 'Slurm': running/slurm.md
- 'HyperQueue': running/hyperqueue.md
- 'Job report': running/jobreport.md
- 'Known issues': running/known-issues.md
- 'Data Management and Storage':
Expand Down