-
Notifications
You must be signed in to change notification settings - Fork 16
Description
What would you like to see added?
Slurm has unexpected (to us) behavior when using resource request flags in varying combinations.
While working on Parabricks, Prema discovered that the default behavior of --ntasks without --nodes is to allocate one node per task, which can be quite expensive.
A bit of diving through the sbatch docs unveiled some semi-implied, intended pathways for allocation strategies, and there are a lot of very different behaviors for controlling how allocations are made.
I can't go into the details just yet because we've only scratched the surface, but here are some examples...
--nodes without --ntasks: Allocates the specified number of nodes, defaults to --ntasks=1, treat using the next case.
--nodes with --ntasks: Allocated the specified number of nodes, and assumes the specified number of tasks per node.
--ntasks without --nodes: The behavior is as though there were a flag --nodes-per-task=1 specified. There is no such flag. The allocation is made with a number of nodes equal to --ntasks.
There is also some inconsistency in either naming resources, or inconsistency in reporting requested vs. allocated resources.
The following job was made with --ntasks=2 with no specification of --nodes.
AllocTRES ReqTRES
------------------------------------------------ ------------------------------------------------
billing=24,cpu=24,gres/gpu=2,mem=200G,node=2 billing=24,cpu=24,gres/gpu=1,mem=100G,node=1
Note that nodes is doubled from the req. The req records the exact input to --nodes (which was defaulted to 1 by specifying only --ntasks), and to the other resources. It does NOT record ntasks!
However, the alloc records how much total was allocated. But only for nodes and mem and gres/gpu. cpu is not doubled! It isn't fully clear whether cpu allocation is recorded per-node, or whether the number of cpus allocated was divided among the nodes. From past experience, I believe it is per-node. But then why is mem not per node?
It may be that using AllocTRES is not useful and to use -o ReqMem instead to get the 100Gn or 10Gc or whatever to distinguish between per-core and per-node.