-
Notifications
You must be signed in to change notification settings - Fork 41
Add HyperQueue #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add HyperQueue #226
Changes from 2 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
2076b71
Add HyperQueue
rjanalik 87ae5a5
Move HyperQueue from services to running
rjanalik 597a2e0
Add info about GREASY: improve SEO
rjanalik 8106217
Update installation
rjanalik 577b2d1
SLURM -> Slurm
rjanalik 07643c4
Formating
rjanalik e7b8728
Code listings
rjanalik e23dbdc
Update index
rjanalik File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| [](){#ref-hyperqueue} | ||
| # HyperQueue | ||
| [HyperQueue](https://it4innovations.github.io/hyperqueue/stable/) is a meta-scheduler designed for high-throughput computing on high-performance computing (HPC) clusters. | ||
| It addresses the inefficiency of using traditional schedulers like SLURM for a large number of small, short-lived tasks by allowing you to bundle them into a single, larger SLURM job. | ||
| This approach minimizes scheduling overhead and improves resource utilization. | ||
|
|
||
| By using a meta-scheduler like HyperQueue, you get fine-grained control over your tasks within the allocated resources of a single batch job. | ||
| It's especially useful for workflows that involve numerous tasks, each requiring minimal resources (e.g., a single CPU core or GPU) or a short runtime. | ||
|
|
||
| [](){#ref-hyperqueue-setup} | ||
| ## Setup | ||
| Before you can use HyperQueue, you'll need to download it. No installation is needed as it is a statically linked binary with no external dependencies. Here’s how to set it up in your home directory: | ||
rjanalik marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| $ cd ~/bin | ||
| $ wget https://github.com/It4innovations/hyperqueue/releases/download/v0.23.0/hq-v0.23.0-linux-arm64-linux.tar.gz | ||
| $ tar -zxf hq-v0.23.0-linux-arm64-linux.tar.gz | ||
| $ rm hq-v0.23.0-linux-arm64-linux.tar.gz | ||
| ``` | ||
rjanalik marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| To make the `hq` command available in your current session, add it to your `PATH` environment variable: | ||
|
|
||
| ```bash | ||
| $ export PATH=~/bin:$PATH | ||
| ``` | ||
| You can also add this line to your `~/.bashrc` or `~/.bash_profile` to make the change permanent. | ||
|
|
||
| [](){#ref-hyperqueue-example} | ||
| ## Example workflow | ||
| This example demonstrates a basic HyperQueue workflow by running a large number of "hello world" tasks, some on a CPU and others on a GPU. | ||
|
|
||
| [](){#ref-hyperqueue-example-script-task} | ||
| ### The task script | ||
| First, create a simple script that represents the individual tasks you want to run. | ||
| This script will be executed by HyperQueue workers. | ||
|
|
||
| ```bash title="task.sh" | ||
| #!/usr/local/bin/bash | ||
|
|
||
| # This script is a single task that will be run by HyperQueue. | ||
| # HQ_TASK_ID is an environment variable set by HyperQueue for each task. | ||
| # See HyperQueue documentation for other variables set by HyperQueue | ||
|
|
||
| echo "$(date): start task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}" | ||
|
|
||
| # Simulate some work | ||
| sleep 30 | ||
|
|
||
| echo "$(date): end task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}" | ||
| ``` | ||
|
|
||
| [](){#ref-hyperqueue-example-script-simple} | ||
| ### Simple SLURM batch job script | ||
| Next, create a SLURM batch script that will launch the HyperQueue server and workers, submit your tasks, wait for the tasks to finish, and then shut everything down. | ||
|
|
||
| ```bash title="job.sh" | ||
| #!/usr/local/bin/bash | ||
|
|
||
| #SBATCH --nodes 2 | ||
| #SBATCH --ntasks-per-node 1 | ||
| #SBATCH --time 00:10:00 | ||
| #SBATCH --partition normal | ||
| #SBATCH --account <account> | ||
|
|
||
| # Start HyperQueue server and workers | ||
| hq server start & | ||
|
|
||
| # Wait for the server to be ready | ||
| hq server wait | ||
|
|
||
| # Start HyperQueue workers | ||
| srun hq worker start & | ||
|
|
||
| # Submit tasks (300 CPU tasks and 16 GPU tasks) | ||
| hq submit --resource "cpus=1" --array 1-300 ./task.sh; | ||
| hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh; | ||
|
|
||
| # Wait for all jobs to finish | ||
| hq job wait all | ||
|
|
||
| # Stop HyperQueue server and workers | ||
| hq server stop | ||
|
|
||
| echo | ||
| echo "Everything done!" | ||
| ``` | ||
|
|
||
| To submit this job, use `sbatch`: | ||
| ```bash | ||
| $ sbatch job.sh | ||
| ``` | ||
|
|
||
| [](){#ref-hyperqueue-example-script-advanced} | ||
| ### More robust SLURM batch job script | ||
| A powerful feature of HyperQueue is the ability to resume a job that was interrupted, for example, by reaching a time limit or a node failure. | ||
| You can achieve this by using a journal file to save the state of your tasks. | ||
| By adding a journal file, HyperQueue can track which tasks were completed and which are still pending. | ||
| When you restart the job, it will only run the unfinished tasks. | ||
|
|
||
| Another useful feature is running multiple servers simultaneously. | ||
| This can be achieved by starting each server with unique directory set in the variable `HQ_SERVER_DIR`. | ||
|
|
||
| Here's an improved version of the batch script that incorporates these features: | ||
|
|
||
| ```bash title="job.sh" | ||
| #!/usr/local/bin/bash | ||
|
|
||
| #SBATCH --nodes 2 | ||
| #SBATCH --ntasks-per-node 1 | ||
| #SBATCH --time 00:10:00 | ||
| #SBATCH --partition normal | ||
| #SBATCH --account <account> | ||
|
|
||
| # Set up the journal file for state tracking | ||
| # If an argument is provided, use it to restore a previous job | ||
| # Otherwise, create a new journal file for the current job | ||
| RESTORE_JOB=$1 | ||
| if [ -n "$RESTORE_JOB" ]; then | ||
| export JOURNAL=~/.hq-journal-${RESTORE_JOB} | ||
| else | ||
| export JOURNAL=~/.hq-journal-${SLURM_JOBID} | ||
| fi | ||
|
|
||
| # Ensure each SLURM job has its own HyperQueue server directory | ||
| export HQ_SERVER_DIR=~/.hq-server-${SLURM_JOBID} | ||
|
|
||
| # Start the HyperQueue server with the journal file | ||
| hq server start --journal=${JOURNAL} & | ||
|
|
||
| # Wait for the server to be ready | ||
| hq server wait --timeout=120 | ||
| if [ "$?" -ne 0 ]; then | ||
| echo "Server did not start, exiting ..." | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Start HyperQueue workers | ||
| srun hq worker start & | ||
|
|
||
| # Submit tasks only if we are not restoring a previous job | ||
| # (300 CPU tasks and 16 GPU tasks) | ||
| if [ -z "$RESTORE_JOB" ]; then | ||
| hq submit --resource "cpus=1" --array 1-300 ./task.sh; | ||
| hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh; | ||
| fi | ||
|
|
||
| # Wait for all jobs to finish | ||
| hq job wait all | ||
|
|
||
| # Stop HyperQueue server and workers | ||
| hq server stop | ||
|
|
||
| # Clean up server directory and journal file | ||
| rm -rf ${HQ_SERVER_DIR} | ||
| rm -rf ${JOURNAL} | ||
|
|
||
| echo | ||
| echo "Everything done!" | ||
| ``` | ||
|
|
||
| To submit a new job, use `sbatch`: | ||
| ```bash | ||
| $ sbatch job.sh | ||
| ``` | ||
|
|
||
| If the job fails for any reason, you can resubmit it and tell HyperQueue to pick up where it left off by passing the original SLURM job ID as an argument: | ||
|
|
||
| ```bash | ||
| $ sbatch job.sh <job-id> | ||
| ``` | ||
|
|
||
| The script will detect the argument, load the journal file from the previous run, and only execute the tasks that haven't been completed. | ||
|
|
||
| !!! info "External references" | ||
| You can find other features and examples in the HyperQueue [documentation](https://it4innovations.github.io/hyperqueue/stable/). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.