Skip to content

Commit a0d6f53

Browse files
authored
Merge pull request #120 from stackhpc/feature/autoscale
Add support for autoscaling
2 parents 42db2fa + 9b04782 commit a0d6f53

File tree

15 files changed

+203
-54
lines changed

15 files changed

+203
-54
lines changed

.github/workflows/ci.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ jobs:
4141
- test10
4242
- test11
4343
- test12
44+
- test13
45+
- test14
4446

4547
exclude:
4648
- image: 'centos:7'
@@ -59,6 +61,10 @@ jobs:
5961
scenario: test11
6062
- image: 'centos:7'
6163
scenario: test12
64+
- image: 'centos:7'
65+
scenario: test13
66+
- image: 'centos:7'
67+
scenario: test14
6268

6369
steps:
6470
- name: Check out the codebase.

README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,32 +39,35 @@ package in the image.
3939

4040
`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one.
4141

42-
`openhpc_ram_multiplier`: Optional, default `0.95`. Multiplier used in the calculation: `total_memory * openhpc_ram_multiplier` when setting `RealMemory` for the partition in slurm.conf. Can be overriden on a per partition basis using `openhpc_slurm_partitions.ram_multiplier`. Has no effect if `openhpc_slurm_partitions.ram_mb` is set.
43-
4442
### slurm.conf
4543

4644
`openhpc_slurm_partitions`: list of one or more slurm partitions. Each partition may contain the following values:
4745
* `groups`: If there are multiple node groups that make up the partition, a list of group objects can be defined here.
4846
Otherwise, `groups` can be omitted and the following attributes can be defined in the partition object:
4947
* `name`: The name of the nodes within this group.
5048
* `cluster_name`: Optional. An override for the top-level definition `openhpc_cluster_name`.
49+
* `extra_nodes`: Optional. A list of additional node definitions, e.g. for nodes in this group/partition not controlled by this role. Each item should be a dict, with keys/values as per the ["NODE CONFIGURATION"](https://slurm.schedmd.com/slurm.conf.html#lbAE) docs for slurm.conf. Note the key `NodeName` must be first.
5150
* `ram_mb`: Optional. The physical RAM available in each server of this group ([slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `RealMemory`) in MiB. This is set using ansible facts if not defined, equivalent to `free --mebi` total * `openhpc_ram_multiplier`.
52-
53-
For each group (if used) or partition there must be an ansible inventory group `<cluster_name>_<group_name>`, with all nodes in this inventory group added to the group/partition. Note that:
54-
- Nodes may have arbitrary hostnames but these should be lowercase to avoid a mismatch between inventory and actual hostname.
55-
- Nodes in a group are assumed to be homogenous in terms of processor and memory.
56-
- An inventory group may be empty, but if it is not then the play must contain at least one node from it (used to set processor information).
57-
* `ram_multiplier`: Optional. An override for the top-level definition `openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
51+
* `ram_multiplier`: Optional. An override for the top-level definition `openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
5852
* `default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
5953
* `maxtime`: Optional. A partition-specific time limit in hours, minutes and seconds ([slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`). The default value is
6054
given by `openhpc_job_maxtime`.
6155

56+
For each group (if used) or partition any nodes in an ansible inventory group `<cluster_name>_<group_name>` will be added to the group/partition. Note that:
57+
- Nodes may have arbitrary hostnames but these should be lowercase to avoid a mismatch between inventory and actual hostname.
58+
- Nodes in a group are assumed to be homogenous in terms of processor and memory.
59+
- An inventory group may be empty, but if it is not then the play must contain at least one node from it (used to set processor information).
60+
- Nodes may not appear in more than one group.
61+
- A group/partition definition which does not have either a corresponding inventory group or a `extra_nodes` will raise an error.
62+
6263
`openhpc_job_maxtime`: A maximum time job limit in hours, minutes and seconds. The default is `24:00:00`.
6364

6465
`openhpc_cluster_name`: name of the cluster
6566

6667
`openhpc_config`: Mapping of additional parameters and values for `slurm.conf`. Note these will override any included in `templates/slurm.conf.j2`.
6768

69+
`openhpc_ram_multiplier`: Optional, default `0.95`. Multiplier used in the calculation: `total_memory * openhpc_ram_multiplier` when setting `RealMemory` for the partition in slurm.conf. Can be overriden on a per partition basis using `openhpc_slurm_partitions.ram_multiplier`. Has no effect if `openhpc_slurm_partitions.ram_mb` is set.
70+
6871
`openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation))
6972

7073
#### Accounting

defaults/main.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ openhpc_resume_timeout: 300
1212
openhpc_retry_delay: 10
1313
openhpc_job_maxtime: 24:00:00
1414
openhpc_config: "{{ openhpc_extra_config | default({}) }}"
15+
openhpc_slurm_configless: "{{ 'enable_configless' in openhpc_config.get('SlurmctldParameters', []) }}"
16+
1517
openhpc_state_save_location: /var/spool/slurm
1618

1719
# Accounting
@@ -49,7 +51,6 @@ ohpc_slurm_services:
4951
ohpc_release_repos:
5052
"7": "https://github.com/openhpc/ohpc/releases/download/v1.3.GA/ohpc-release-1.3-1.el7.x86_64.rpm" # ohpc v1.3 for Centos 7
5153
"8": "http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm" # ohpc v2 for Centos 8
52-
openhpc_slurm_configless: false
5354
openhpc_munge_key:
5455
openhpc_login_only_nodes: ''
5556
openhpc_module_system_install: true

filter_plugins/group_hosts.py renamed to filter_plugins/slurm_conf.py

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
# License for the specific language governing permissions and limitations
1313
# under the License.
1414

15+
# NB: To test this from the repo root run:
16+
# ansible-playbook -i tests/inventory -i tests/inventory-mock-groups tests/filter.yml
1517

1618
from ansible import errors
1719
import jinja2
@@ -30,11 +32,25 @@ def _get_hostvar(context, var_name, inventory_hostname=None):
3032
namespace = context["hostvars"][inventory_hostname]
3133
return namespace.get(var_name)
3234

33-
@jinja2.contextfilter
34-
def group_hosts(context, group_names):
35-
return {g:_group_hosts(context["groups"].get(g, [])) for g in sorted(group_names)}
35+
def hostlist_expression(hosts):
36+
""" Group hostnames using Slurm's hostlist expression format.
37+
38+
E.g. with an inventory containing:
39+
40+
[compute]
41+
dev-foo-0 ansible_host=localhost
42+
dev-foo-3 ansible_host=localhost
43+
my-random-host
44+
dev-foo-4 ansible_host=localhost
45+
dev-foo-5 ansible_host=localhost
46+
dev-compute-0 ansible_host=localhost
47+
dev-compute-1 ansible_host=localhost
48+
49+
Then "{{ groups[compute] | hostlist_expression }}" will return:
50+
51+
["dev-foo-[0,3-5]", "dev-compute-[0-1]", "my-random-host"]
52+
"""
3653

37-
def _group_hosts(hosts):
3854
results = {}
3955
unmatchable = []
4056
for v in hosts:
@@ -58,9 +74,16 @@ def _group_numbers(numbers):
5874
prev = v
5975
return ','.join(['{}-{}'.format(u[0], u[-1]) if len(u) > 1 else str(u[0]) for u in units])
6076

77+
def error(condition, msg):
78+
""" Raise an error if condition is not True """
79+
80+
if not condition:
81+
raise errors.AnsibleFilterError(msg)
82+
6183
class FilterModule(object):
6284

6385
def filters(self):
6486
return {
65-
'group_hosts': group_hosts
87+
'hostlist_expression': hostlist_expression,
88+
'error': error,
6689
}

molecule/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ test10 | 1 | N | As for #5 but then tries to ad
2121
test11 | 1 | N | As for #5 but then deletes a node (actually changes the partition due to molecule/ansible limitations)
2222
test12 | 1 | N | As for #5 but enabling job completion and testing `sacct -c`
2323
test13 | 1 | N | As for #5 but tests `openhpc_config` variable.
24+
test14 | 1 | | As for #5 but also tests `extra_nodes` via State=DOWN nodes.
2425

2526
# Local Installation & Running
2627

molecule/test13/verify.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@
1313
command: scontrol show config
1414
register: slurm_config
1515
- assert:
16-
that: "item in slurm_config.stdout"
16+
that: "item in (slurm_config.stdout_lines | map('replace', ' ', ''))"
1717
fail_msg: "FAILED - {{ item }} not found in slurm config"
1818
loop:
1919
- SlurmctldSyslogDebug=error
20-
- SlurmctldSyFirstJobId=13
20+
- FirstJobId=13

molecule/test14/converge.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
- name: Converge
3+
hosts: all
4+
tasks:
5+
- name: "Include ansible-role-openhpc"
6+
include_role:
7+
name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}"
8+
vars:
9+
openhpc_enable:
10+
control: "{{ inventory_hostname in groups['testohpc_login'] }}"
11+
batch: "{{ inventory_hostname in groups['testohpc_compute'] }}"
12+
runtime: true
13+
openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}"
14+
openhpc_slurm_partitions:
15+
- name: "compute"
16+
extra_nodes:
17+
# Need to specify IPs for the non-existent State=DOWN nodes, because otherwise even in this state slurmctld will exclude a node with no lookup information from the config.
18+
# We use invalid IPs here (i.e. starting 0.) to flag the fact the nodes shouldn't exist.
19+
# Note this has to be done via slurm config rather than /etc/hosts due to Docker limitations on modifying the latter.
20+
- NodeName: fake-x,fake-y
21+
NodeAddr: 0.42.42.0,0.42.42.1
22+
State: DOWN
23+
CPUs: 1
24+
- NodeName: fake-2cpu-[3,7-9]
25+
NodeAddr: 0.42.42.3,0.42.42.7,0.42.42.8,0.42.42.9
26+
State: DOWN
27+
CPUs: 2
28+
openhpc_cluster_name: testohpc
29+
openhpc_slurm_configless: true
30+

molecule/test14/molecule.yml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
name: single partition, group is partition
3+
driver:
4+
name: docker
5+
platforms:
6+
- name: testohpc-login-0
7+
image: ${MOLECULE_IMAGE}
8+
pre_build_image: true
9+
groups:
10+
- testohpc_login
11+
command: /sbin/init
12+
tmpfs:
13+
- /run
14+
- /tmp
15+
volumes:
16+
- /sys/fs/cgroup:/sys/fs/cgroup:ro
17+
networks:
18+
- name: net1
19+
docker_networks:
20+
- name: net1
21+
driver_options:
22+
com.docker.network.driver.mtu: ${DOCKER_MTU:-1500} # 1500 is docker default
23+
- name: testohpc-compute-0
24+
image: ${MOLECULE_IMAGE}
25+
pre_build_image: true
26+
groups:
27+
- testohpc_compute
28+
command: /sbin/init
29+
tmpfs:
30+
- /run
31+
- /tmp
32+
volumes:
33+
- /sys/fs/cgroup:/sys/fs/cgroup:ro
34+
networks:
35+
- name: net1
36+
docker_networks:
37+
- name: net1
38+
driver_options:
39+
com.docker.network.driver.mtu: ${DOCKER_MTU:-1500} # 1500 is docker default
40+
- name: testohpc-compute-1
41+
image: ${MOLECULE_IMAGE}
42+
pre_build_image: true
43+
groups:
44+
- testohpc_compute
45+
command: /sbin/init
46+
tmpfs:
47+
- /run
48+
- /tmp
49+
volumes:
50+
- /sys/fs/cgroup:/sys/fs/cgroup:ro
51+
networks:
52+
- name: net1
53+
docker_networks:
54+
- name: net1
55+
driver_options:
56+
com.docker.network.driver.mtu: ${DOCKER_MTU:-1500} # 1500 is docker default
57+
provisioner:
58+
name: ansible
59+
verifier:
60+
name: ansible

molecule/test14/verify.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
3+
- name: Check slurm hostlist
4+
hosts: testohpc_login
5+
tasks:
6+
- name: Get slurm partition info
7+
command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace
8+
register: sinfo
9+
- name:
10+
assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
11+
that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,6,down*,fake-2cpu-[3,7-9],fake-x,fake-y', 'compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']"
12+
fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}"

molecule/test5/molecule.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@ platforms:
1616
- /sys/fs/cgroup:/sys/fs/cgroup:ro
1717
networks:
1818
- name: net1
19+
docker_networks:
20+
- name: net1
21+
driver_options:
22+
com.docker.network.driver.mtu: ${DOCKER_MTU:-1500} # 1500 is docker default
1923
- name: testohpc-compute-0
2024
image: ${MOLECULE_IMAGE}
2125
pre_build_image: true
@@ -29,6 +33,10 @@ platforms:
2933
- /sys/fs/cgroup:/sys/fs/cgroup:ro
3034
networks:
3135
- name: net1
36+
docker_networks:
37+
- name: net1
38+
driver_options:
39+
com.docker.network.driver.mtu: ${DOCKER_MTU:-1500} # 1500 is docker default
3240
- name: testohpc-compute-1
3341
image: ${MOLECULE_IMAGE}
3442
pre_build_image: true
@@ -42,6 +50,10 @@ platforms:
4250
- /sys/fs/cgroup:/sys/fs/cgroup:ro
4351
networks:
4452
- name: net1
53+
docker_networks:
54+
- name: net1
55+
driver_options:
56+
com.docker.network.driver.mtu: ${DOCKER_MTU:-1500} # 1500 is docker default
4557
provisioner:
4658
name: ansible
4759
verifier:

0 commit comments

Comments
 (0)