Skip to content

Using sbatch does not work well with Julia JIT #77

@affans

Description

@affans

I recently came across this package now that ClusterManagers is throwing a dep warning for Slurm. The new package uses sbatch functionality which lets SLURM handle the resource allocation on top of which Julia then spawns worker processes (i.e., sbatch -> srun). While this method works well, the workflow is different than ClusterManagers and for myself, I don't think this method is well suited for Julia's workflow, especially with prototyping and interactivity. Let me illustrate.

In the old method, my workflow was like this:

using ClusterManagers
using Distributed 
using Revise
addprocs(SlurmManager(10), kwargs) # ask for 10 tasks

@everywhere includet("model.jl")  # includet for Revise, contains functions for long-running scripts 

function run_simulations(params)  # defined on the head node
   results = pmap(1:100) do x 
      run_long_simulation(params) # runs on the worker processes
   end 
   return results  # an array of outputs from run_long_simulation
end

function process_simluations()   # defined/runs on the headnode
# process/plot simulation results
end

This workflow was great. After the initial allocation and loading of script using Revise, I could run_simulations(); process_simulations (which are now on the headnode) and generate summary statistics, plots, and so on. If I needed to change parameters, I could simply run run_simulations(params) with a different set. This means that I take advantage of all the compiled code on the worker processes. Using Revise also means I could go into run_long_simulation(), make my changes, and run_simulations() will pick those changes up (across all workers).

The sbatch method does not give you this flexibility. The main issues are

  • It has to compile the code every time you run sbatch script.jl.
  • Lose interactive flexibility - have to save data to files, have another instance of Julia open for analysis, etc.
  • Issues with project directory as sbatch runs from a different working directory (yes, there is a env variable set with the working dir so it's managable)
  • The initial execution of the script runs on the allocated resources instead of the head/login node.

This really hurts productivity and workflow and feels not very "Julia"-like.

Alternative / Going back to the old method
I found a workaround to replicate the above behaviour without using sbatch. From the terminal,

(base) odinuser02@podin:~$ salloc -N 2 --ntasks-per-node=10 bash
salloc: Granted job allocation 495
(base) odinuser02@podin:~$

This throws me in an interactive session (on the headnode). Now I launch julia

julia> using Distributed, SlurmClusterManager

julia> addprocs(SlurmManager(),
                exeflags="--project=$(ENV["SLURM_SUBMIT_DIR"])")
20-element Vector{Int64}:
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21

This lets met work interactively, working directory/projects are easy (i.e., julia --project=. sets the correct project), I can use Revise, and prototype my model. Once my pmap returns data, I can use plotting libraries (which on my cluster are only on the head node).

julia> @everywhere println("hello from $(myid()):$(gethostname())")
hello from 1:podin
hello from 4:ops03
hello from 9:ops03
hello from 14:opsc01
hello from 6:ops03
hello from 2:ops03
hello from 10:ops03
hello from 7:ops03
hello from 21:ops03
hello from 5:ops03
hello from 11:opsc01
hello from 8:ops03

I am mainly using this issue for awareness and providing a method that replicates the old workflow. I think having an example of using salloc in the README might be useful for a lot of folks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions