Skip to content

Commit aab6486

Browse files
committed
Fix issues in slurm and torque so that multiple nodes can be added with a cold start
a. Add ASG Max number of dummy-compute nodes as part of potentially available resources to both slurm and torque b. Make stack_name available to the nodewatcher so that the daemon that controls scaledown has access to the stack's status. Scaledown process will only start once stack status is complete Signed-off-by: Balaji Sridharan <[email protected]>
1 parent a61a93f commit aab6486

File tree

3 files changed

+15
-3
lines changed

3 files changed

+15
-3
lines changed

templates/default/nodewatcher.cfg.erb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
region = <%= node['cfncluster']['cfn_region'] %>
33
scheduler = <%= node['cfncluster']['cfn_scheduler'] %>
44
proxy = <%= node['cfncluster']['cfn_proxy'] %>
5-
scaledown_idletime = <%= node['cfncluster']['cfn_scaledown_idletime'] %>
5+
scaledown_idletime = <%= node['cfncluster']['cfn_scaledown_idletime'] %>
6+
stack_name = <%= node['cfncluster']['stack_name'] %>

templates/default/slurm.conf.erb

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,15 @@ JobCompType=jobcomp/none
102102
# when manually editing! Default node is there to allow
103103
# submission of jobs to an empty cluster.
104104
#PARTITION:compute
105-
NodeName=dummy-compute Procs=2048 State=UNKNOWN
105+
<%
106+
def append_dummy
107+
if node['cfncluster']['cfn_max_queue_size'].to_i > 1
108+
"dummy-compute[1-" + node['cfncluster']['cfn_max_queue_size'] + "]"
109+
else
110+
"dummy-compute"
111+
end
112+
end
113+
%>
114+
NodeName=<%= append_dummy %> Procs=2048 State=UNKNOWN
106115
#NodeName= Procs=1 State=UNKNOWN
107-
PartitionName=compute Nodes=dummy-compute Default=YES MaxTime=INFINITE State=UP
116+
PartitionName=compute Nodes=<%= append_dummy %> Default=YES MaxTime=INFINITE State=UP

templates/default/torque.setup.erb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ echo set server managers += $USER | qmgr
8080
qmgr -c 'set server scheduling = true'
8181
qmgr -c 'set server keep_completed = 300'
8282
qmgr -c 'set server mom_job_sync = true'
83+
qmgr -c 'set server resources_available.nodect = <%= node['cfncluster']['cfn_max_queue_size'] %>'
8384

8485
# create default queue
8586

@@ -89,6 +90,7 @@ qmgr -c 'set queue batch started = true'
8990
qmgr -c 'set queue batch enabled = true'
9091
qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
9192
qmgr -c 'set queue batch resources_default.nodes = 1'
93+
qmgr -c 'set queue batch resources_available.nodect = <%= node['cfncluster']['cfn_max_queue_size'] %>'
9294

9395
qmgr -c 'set server default_queue = batch'
9496

0 commit comments

Comments
 (0)