-
Notifications
You must be signed in to change notification settings - Fork 8
Description
I recently came across this package now that ClusterManagers is throwing a dep warning for Slurm. The new package uses sbatch
functionality which lets SLURM handle the resource allocation on top of which Julia then spawns worker processes (i.e., sbatch -> srun
). While this method works well, the workflow is different than ClusterManagers
and for myself, I don't think this method is well suited for Julia's workflow, especially with prototyping and interactivity. Let me illustrate.
In the old method, my workflow was like this:
using ClusterManagers
using Distributed
using Revise
addprocs(SlurmManager(10), kwargs) # ask for 10 tasks
@everywhere includet("model.jl") # includet for Revise, contains functions for long-running scripts
function run_simulations(params) # defined on the head node
results = pmap(1:100) do x
run_long_simulation(params) # runs on the worker processes
end
return results # an array of outputs from run_long_simulation
end
function process_simluations() # defined/runs on the headnode
# process/plot simulation results
end
This workflow was great. After the initial allocation and loading of script using Revise, I could run_simulations(); process_simulations
(which are now on the headnode) and generate summary statistics, plots, and so on. If I needed to change parameters, I could simply run run_simulations(params)
with a different set. This means that I take advantage of all the compiled code on the worker processes. Using Revise also means I could go into run_long_simulation()
, make my changes, and run_simulations()
will pick those changes up (across all workers).
The sbatch
method does not give you this flexibility. The main issues are
- It has to compile the code every time you run
sbatch script.jl
. - Lose interactive flexibility - have to save data to files, have another instance of Julia open for analysis, etc.
- Issues with project directory as
sbatch
runs from a different working directory (yes, there is a env variable set with the working dir so it's managable) - The initial execution of the script runs on the allocated resources instead of the head/login node.
This really hurts productivity and workflow and feels not very "Julia"-like.
Alternative / Going back to the old method
I found a workaround to replicate the above behaviour without using sbatch
. From the terminal,
(base) odinuser02@podin:~$ salloc -N 2 --ntasks-per-node=10 bash
salloc: Granted job allocation 495
(base) odinuser02@podin:~$
This throws me in an interactive session (on the headnode). Now I launch julia
julia> using Distributed, SlurmClusterManager
julia> addprocs(SlurmManager(),
exeflags="--project=$(ENV["SLURM_SUBMIT_DIR"])")
20-element Vector{Int64}:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
This lets met work interactively, working directory/projects are easy (i.e., julia --project=.
sets the correct project), I can use Revise, and prototype my model. Once my pmap
returns data, I can use plotting libraries (which on my cluster are only on the head node).
julia> @everywhere println("hello from $(myid()):$(gethostname())")
hello from 1:podin
hello from 4:ops03
hello from 9:ops03
hello from 14:opsc01
hello from 6:ops03
hello from 2:ops03
hello from 10:ops03
hello from 7:ops03
hello from 21:ops03
hello from 5:ops03
hello from 11:opsc01
hello from 8:ops03
I am mainly using this issue for awareness and providing a method that replicates the old workflow. I think having an example of using salloc
in the README might be useful for a lot of folks.