Futures on HPC SLURM cluster #468

crystalfp · 2021-02-09T13:39:33Z

crystalfp
Feb 9, 2021

I'm searching for advice on how to efficiently parallelize a code based on the future package (a black box almost) on a HPC machine that uses SLURM.

Seems the obvious suggestion is to use future.bactchtools but this solution is extremely slow, because it is submitting a SLURM job for each future. Sure I'm using it in an incorrect way. So I started from the plans offered by future and used the Mandelbrot demo with the grid increased to 30x30.

I submitted the job using this sbatch script:

#!/bin/bash -l
#SBATCH --job-name=future.test
#SBATCH --output=job-%j.out
#SBATCH --time=00:05:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=36
#SBATCH --mem=16G
#SBATCH --partition=normal
#SBATCH --hint=nomultithread

Rscript final.R `scontrol show hostname ${SLURM_JOB_NODELIST}`

The call to scontrol converts the compressed nodelist into a list of computing node names.
The R script is:

library("future")
args <- commandArgs(trailingOnly=TRUE)
plan(cluster, workers = args)
#plan(multicore)
#plan(multisession)
#plan(sequential)
options(future.demo.mandelbrot.nrow = 30L)
t1 <- Sys.time()
demo("mandelbrot", package = "future", ask = FALSE)
t2 <- Sys.time() - t1
print(t2)

For the cluster plan I use as workers the 4 computing nodes that SLURM assigns to my job.
The elapsed time results are (sec):

sequential     10.6
multisession  196.5
multicore      35.8
cluster         2.8

The elapsed times are reasonable: a lot of overhead for multi* plans and cluster that is using parallelism with a very good efficiency (sequential time / n. of workers ≅ cluster time).
I'm currently waiting for the user feedback especially to see if adding a multicore plan inside the cluster one helps (it slows a little the Mandelbrot demo).

In all this picture is missing batchtools_slurm for which I'm not able to find how to use it to beat the cluster plan.

library("future.batchtools")
plan(batchtools_slurm, template = "./batchtools.slurm.tmpl")
options(future.demo.mandelbrot.nrow = 30L)
demo("mandelbrot", package = "future", ask = FALSE)

Any idea, experience or suggestion?
Thanks!
mario

Answered by HenrikBengtsson

Mar 5, 2021

#SBATCH --nodes=4

So, this job will get assigned (up to) four compute hosts. In order for R to parallelize on those, you'll have to use plan(cluster, workers = <hostnames>). None of the other future backends mentioned can scale out to multiple machines; they'll only run on the current machine.

sequential     10.6
multisession  196.5
multicore      35.8
cluster         2.8

The sequential/cluster ratio at 10.6/2.8 = 3.8 suggests that the code parallelizes nicely out to 4 workers running on 4 different machines.

The other ratios - sequential/multisession = 0.05 and sequential/multicore = 0.3 clearly indicate that something is not working as you expected. I'm still fairly new to Slurm (…

View full answer

HenrikBengtsson · 2021-03-05T23:07:40Z

HenrikBengtsson
Mar 5, 2021
Maintainer

#SBATCH --nodes=4

So, this job will get assigned (up to) four compute hosts. In order for R to parallelize on those, you'll have to use plan(cluster, workers = <hostnames>). None of the other future backends mentioned can scale out to multiple machines; they'll only run on the current machine.

sequential     10.6
multisession  196.5
multicore      35.8
cluster         2.8

The sequential/cluster ratio at 10.6/2.8 = 3.8 suggests that the code parallelizes nicely out to 4 workers running on 4 different machines.

The other ratios - sequential/multisession = 0.05 and sequential/multicore = 0.3 clearly indicate that something is not working as you expected. I'm still fairly new to Slurm (I've now got access to such a cluster so I'm planning to catch soon), but I suspect that you're ending up over parallelizing here. It might be due to a bug in availableCores() causing it to believe it got way more CPU cores from Slurm on the current machine than what Slurm actually gave it.

Can you add the following to your job script:

env | grep SLURM
Rscript -e "parallelly::availableCores(which = 'all')"

and let me know what it gives? It should help reveal what's going on.

...
Rscript final.R `scontrol show hostname ${SLURM_JOB_NODELIST}`
The call to scontrol converts the compressed nodelist into a list of computing node names.

Nice, I didn't know about scontrol show hostname ...; that's really useful. FYI, soon-ish, you'll be able to get this directly from within R using availableWorkers(), cf. futureverse/parallelly#44. That is, you'll be able to do:

plan(cluster, workers = availableWorkers())

without having to pass the expanded nodelist as an argument. Even better, the default for cluster is workers = availableWorkers(), so when the next release of parallelly is on CRAN, you'll only have to call:

plan(cluster)

and it'll automatically detect that you're running on Slurm and what hostnames you've got allotted to work with.

library("future.batchtools")
plan(batchtools_slurm, template = "./batchtools.slurm.tmpl")
options(future.demo.mandelbrot.nrow = 30L)
demo("mandelbrot", package = "future", ask = FALSE)

So, first of all, that Mandelbrot demo is not the best example for benchmarking this, especially since it does not do any map-reduce internally. Here tools such as future_lapply() or future_map() are much more efficient and probably more relevant for your everyday tasks on a compute cluster.

Instead, think about your HPC scheduler as a batch system with high throughput at the cost of high latency. That is, you can process lots of tasks over time, but the turnaround time per task is much higher than you have on a single machine.

You wrote:

... because it is submitting a SLURM job for each future

Yes, this is a known limitation of future.batchtools, or rather a lack of skills in the core future framework that can automagically predict what you want to do and merge chunks of futures in a single ones. That's actually on the roadmap but quite far ahead. Until the, you want to rely on map-reduce API such as future.apply and furrr to chunk up your data and distribute them out in larger futures (=more elements per job).

There's also a concept of nested parallelism that you can make use of. I use that myself when processing human sequencing data. I use future.batchtools to submitting one job per person, and then a second layer where I parallel process the 25 chromosomes on 25 cores using multisession (or multicore where it works). You find an example of that in https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html.

13 replies

HenrikBengtsson Mar 12, 2021
Maintainer

Regarding SLURM_CPUS_PER_TASK vs SLURM_JOB_CPUS_PER_NODE: After reading more on the Slurm man pages, it looks like SLURM_JOB_CPUS_PER_NODE is something that allocates for you, i.e. you don't really know what you get until you get it. In contrast, SLURM_CPUS_PER_TASK is always what you specify with --cpus-per-task=36 and it applies to all nodes you've been allocated. From the man pages, it also looks like it is never set unless --cpus-per-task=<n> is specified.
In other words, it's just an env var recording what resources you asked for.

This makes me believe that availableCores() and availableWorkers() should respect SLURM_CPUS_PER_TASK, if available, with the fallback to SLURM_JOB_CPUS_PER_NODE. I will go ahead and update the devel version parallelly to do this.

HenrikBengtsson Mar 12, 2021
Maintainer

I've updated parallelly accordingly. To wrap up, with

SLURM_JOB_NODELIST=nid0[1398-1401]
SLURM_JOB_CPUS_PER_NODE=72(x4)
SLURM_CPUS_PER_TASK=36

we now get

> table(parallelly::availableWorkers())
nid01398 nid01399 nid01400 nid01401 
      36       36       36       36

and without asking for --cpus-per-task=36, we don't have SLURM_CPUS_PER_TASK, and we get:

> table(parallelly::availableWorkers())
nid01398 nid01399 nid01400 nid01401 
      72       72       72       72

HenrikBengtsson Mar 14, 2021
Maintainer

FYI, parallelly 1.24.0 just arrived on CRAN (https://cran.r-project.org/package=parallelly). It includes the above Slurm improvements.

crystalfp Mar 14, 2021
Author

Fantastic! Should I simply install parallely or should I reinstall future? This week I will try it. Thanks!

HenrikBengtsson Mar 14, 2021
Maintainer

update.packages() should do it. It'll pick up the new parallelly version. No need to reinstall future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Futureverse

Futures on HPC SLURM cluster #468

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 13 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Futureverse

Futures on HPC SLURM cluster #468

Uh oh!

crystalfp Feb 9, 2021

Replies: 1 comment · 13 replies

Uh oh!

Uh oh!

HenrikBengtsson Mar 5, 2021 Maintainer

Uh oh!

HenrikBengtsson Mar 12, 2021 Maintainer

Uh oh!

HenrikBengtsson Mar 12, 2021 Maintainer

Uh oh!

HenrikBengtsson Mar 14, 2021 Maintainer

Uh oh!

crystalfp Mar 14, 2021 Author

Uh oh!

HenrikBengtsson Mar 14, 2021 Maintainer

crystalfp
Feb 9, 2021

Replies: 1 comment 13 replies

HenrikBengtsson
Mar 5, 2021
Maintainer

HenrikBengtsson Mar 12, 2021
Maintainer

HenrikBengtsson Mar 12, 2021
Maintainer

HenrikBengtsson Mar 14, 2021
Maintainer

crystalfp Mar 14, 2021
Author

HenrikBengtsson Mar 14, 2021
Maintainer