Combining HyperQueue with MPI #619

mbercx · 2023-09-10T19:20:57Z

mbercx
Sep 10, 2023

First of all: I'm loving HyperQueue and its features. 🙏 Since integrating it with our workflow manager AiiDA it's been an invaluable tool to partially use nodes on clusters that have an exclusive node-job policy, and avoiding queueing for very small jobs in my workflows that need to run on the compute nodes.

I've been having a bit of trouble combining HyperQueue with MPI, however. Below I outline my current approach, would be great to get some feedback and suggestions!

Running on a single node

For the use cases described above, I've so far been using HQ with an allocation that only uses a single node. Since the calculations I'm running are vastly more efficient with MPI, I typically run a HQ auto-allocation such as:

hq alloc add slurm --time-limit 30m --backlog 1 --workers-per-alloc 1 --max-worker-count 1 --name single --cpus 128 -- -p debug -A project_465000106 --mem=227318 --ntasks-per-node=128

And then submit HQ jobs similar to:

#!/bin/bash
#HQ --name=QE-run
#HQ --cpus="128 compact"
#HQ --pin taskset
#HQ --stdout stdout
#HQ --stderr stderr

# Load modules

srun -n 128 --mem-per-cpu=500 /scratch/project_465000106/src/qe-7.2/bin/pw.x -in scf.in  > scf.out

This can also be used for submitting multiple jobs in parallel on a single node, but then I typically have to tweak the --oversubscribe, --overlap and --cpu-bind (in combination with $HQ_CPUS) to make it work. This seems to be cluster-dependent and I can't always get it to work. Is there a better approach I'm missing?

Running on multiple nodes

Another use case is when I want to run a multi-node Slurm job, and run multiple single-node HQ jobs in this Slurm job. This can be useful when I have a lot of small jobs to run but Slurm is configured to only allow a certain number of jobs in the queue for a partition.

To test, this I was looking at the documentation for manual submission:

https://it4innovations.github.io/hyperqueue/stable/deployment/worker/#deploying-a-worker-using-pbsslurm

And also trying an auto-allocation with multiple nodes:

hq alloc add slurm --time-limit 30m --backlog 1 --workers-per-alloc 4 --max-worker-count 4 --name 4nodes --cpus 128 -- -p debug -A project_465000106 --mem=227318

Both suggest using mpirun/srun to run the worker start command. Below is the submission script generated by HQ for the auto-allocation:

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --job-name=hq-alloc-1
#SBATCH --output=/users/mabercx/.hq-server/003/autoalloc/1-test/002/stdout
#SBATCH --error=/users/mabercx/.hq-server/003/autoalloc/1-test/002/stderr
#SBATCH --time=00:30:00
#SBATCH -p debug -A project_465000106 --mem=227318

srun --overlap /pfs/lustrep3/users/mabercx/bin/hq worker start --idle-timeout "5m" --manager "slurm" --server-dir "/users/mabercx/.hq-server/003" --cpus "128" --on-server-lost "finish-running" --time-limit "30m"

When trying this with the HQ job script above, I obviously get into trouble. Calling srun within an srun step seems dubious, and I'm asking for 128 tasks within a job step which only has 4, so I get an error

srun: error: Unable to create step for job 4487945: More processors requested than permitted

Removing srun doesn't have the desired effect either though. HQ is running 4 workers once the allocation starts, but the calculations don't run in parallel, and the one running is only using 4 mpi tasks.

My current "solution"

After quite a bit of trial and error (in lieu of understanding and experience), I've come up with a solution that almost works. Basically, I run the same HQ job script as above for the calculations, but do a manual Slurm submission starting four HQ workers in the background:

/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &
/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &
/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &
/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &

wait

Full Slurm submission script

#!/bin/bash
#SBATCH --no-requeue
#SBATCH --job-name="HQ-debug"
#SBATCH --get-user-env
#SBATCH --output=_scheduler-stdout.txt
#SBATCH --error=_scheduler-stderr.txt
#SBATCH --partition=debug
#SBATCH --account=project_465000106
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00
#SBATCH --mem=200000

echo "Job ID: $SLURM_JOB_ID"
echo "Node List: $SLURM_JOB_NODELIST"
echo "Number of Nodes: $SLURM_JOB_NUM_NODES"
echo "Total Number of tasks: $SLURM_NTASKS"
echo "CPUs on this Node: $SLURM_CPUS_ON_NODE"

/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &
/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &
/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &
/pfs/lustrep3/users/mabercx/bin/hq worker start --no-hyper-threading &

wait

When running 4 jobs with 128 tasks, the first three run just fine in parallel, with a similar performance as a single run directly submitted to Slurm. However the fourth fails with the following in the stderr:

srun: error: task 0 launch failed: Error configuring interconnect
srun: error: task 1 launch failed: Error configuring interconnect
srun: error: task 2 launch failed: Error configuring interconnect
...

Interestingly, if I run on 3 nodes with 3 HQ workers, I don't get this issue at all. I'm still looking into running with more nodes, but am queueing quite a bit atm.

EDIT: managed to do tests on more nodes, and it seems 3 nodes is a hard limit for some reason. If I try to use my approach above on anything more, I get the interconnect errors. If you're curious about my test setup, you can find it here:
https://github.com/mbercx/hq-tests

Again, any suggestions or tips would be most appreciated!

Kobzol · 2023-09-11T07:32:03Z

Kobzol
Sep 11, 2023
Maintainer

This can also be used for submitting multiple jobs in parallel on a single node, but then I typically have to tweak the --oversubscribe, --overlap and --cpu-bind (in combination with $HQ_CPUS) to make it work. This seems to be cluster-dependent and I can't always get it to work. Is there a better approach I'm missing?

That's a good question that we would like to know the answer to :) Sadly, we don't have a lot of experience with SLURM and we're not sure how best to configure this.

Regarding the multinode situation, my understanding is that srun should be used within a Slurm job to start a process on all available nodes. If there is some other/better way (that should work in general!), please let us know. We really don't have a lot of experience with Slurm and until recently we didn't even have proper access to a Slurm cluster to test this. HQ was mostly tested with PBS.

1 reply

mbercx Sep 23, 2023
Author

Thanks for the response @Kobzol! And sorry for getting back to you so slowly, I seem to have forgotten to press the Reply button when doing so. 😅

I'm also still learning a lot about Slurm, but will definitely be testing what HQ can do on Slurm machines quite extensively and report in. ^^ Will also see if I can find a Slurm expert in my network that can help out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Combining HyperQueue with MPI #619

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Combining HyperQueue with MPI #619

Uh oh!

Uh oh!

mbercx Sep 10, 2023

Running on a single node

Running on multiple nodes

My current "solution"

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

Kobzol Sep 11, 2023 Maintainer

Uh oh!

mbercx Sep 23, 2023 Author

mbercx
Sep 10, 2023

Replies: 1 comment 1 reply

Kobzol
Sep 11, 2023
Maintainer

mbercx Sep 23, 2023
Author