diff --git a/README.md b/README.md index 8edd9a8..28a5f4d 100644 --- a/README.md +++ b/README.md @@ -2,27 +2,27 @@ # stackhpc.openhpc -This Ansible role installs packages and performs configuration to provide an OpenHPC v2.x Slurm cluster. +This Ansible role installs packages and performs configuration to provide a Slurm cluster. By default this uses packages from [OpenHPC](https://openhpc.community/) but it can also use user-provided Slurm binaries. As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches. The minimal image for nodes is a RockyLinux 8 GenericCloud image. +## Task files +This role provides four task files which can be selected by using the `tasks_from` parameter of Ansible's `import_role` or `include_role` modules: +- `main.yml`: Runs `install-ohpc.yml` and `runtime.yml`. Default if no `tasks_from` parameter is used. +- `install-ohpc.yml`: Installs repos and packages for OpenHPC. +- `install-generic.yml`: Installs systemd units etc. for user-provided binaries. +- `runtime.yml`: Slurm/service configuration. + ## Role Variables +Variables only relevant for `install-ohpc.yml` or `install-generic.yml` task files are marked as such below. + `openhpc_extra_repos`: Optional list. Extra Yum repository definitions to configure, following the format of the Ansible -[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module. Respected keys for -each list element: -* `name`: Required -* `description`: Optional -* `file`: Required -* `baseurl`: Optional -* `metalink`: Optional -* `mirrorlist`: Optional -* `gpgcheck`: Optional -* `gpgkey`: Optional - -`openhpc_slurm_service_enabled`: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld). +[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module. + +`openhpc_slurm_service_enabled`: Optional boolean, whether to enable the appropriate slurm service (slurmd/slurmctld). Default `true`. `openhpc_slurm_service_started`: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to `openhpc_slurm_service_enabled`. @@ -30,7 +30,7 @@ each list element: `openhpc_slurm_control_host_address`: Optional string. IP address or name to use for the `openhpc_slurm_control_host`, e.g. to use a different interface than is resolved from `openhpc_slurm_control_host`. -`openhpc_packages`: additional OpenHPC packages to install. +`openhpc_packages`: Optional list. Additional OpenHPC packages to install (`install-ohpc.yml` only). `openhpc_enable`: * `control`: whether to enable control host @@ -44,7 +44,15 @@ each list element: `openhpc_login_only_nodes`: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run `slurmd` to contact the control node for config. -`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one. +`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one (`install-ohpc.yml` only). + +`openhpc_generic_packages`: Optional. List of system packages to install, see `defaults/main.yml` for details (`install-generic.yml` only). + +`openhpc_sbin_dir`: Optional. Path to slurm daemon binaries such as `slurmctld`, default `/usr/sbin` (`install-generic.yml` only). + +`openhpc_bin_dir`: Optional. Path to Slurm user binaries such as `sinfo`, default `/usr/bin` (`install-generic.yml` only). + +`openhpc_lib_dir`: Optional. Path to Slurm libraries, default `/usr/lib64/slurm` (`install-generic.yml` only). ### slurm.conf @@ -122,6 +130,16 @@ that this is *not the same* as the Ansible `omit` [special variable](https://doc `openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation)) +`openhpc_slurmd_spool_dir`: Optional. Absolute path for slurmd state (`slurm.conf` parameter [SlurmdSpoolDir](https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir)) + +`openhpc_slurm_conf_template`: Optional. Path of Jinja template for `slurm.conf` configuration file. Default is `slurm.conf.j2` template in role. **NB:** The required templating is complex, if just setting specific parameters use `openhpc_config` intead. + +`openhpc_slurm_conf_path`: Optional. Path to template `slurm.conf` configuration file to. Default `/etc/slurm/slurm.conf` + +`openhpc_gres_template`: Optional. Path of Jinja template for `gres.conf` configuration file. Default is `gres.conf.j2` template in role. + +`openhpc_cgroup_template`: Optional. Path of Jinja template for `cgroup.conf` configuration file. Default is `cgroup.conf.j2` template in role. + #### Accounting By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`[1](#slurm_ver_footnote). Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting: diff --git a/defaults/main.yml b/defaults/main.yml index 8b9d2e6..8f597ec 100644 --- a/defaults/main.yml +++ b/defaults/main.yml @@ -49,8 +49,12 @@ openhpc_cgroup_default_config: openhpc_config: {} openhpc_cgroup_config: {} openhpc_gres_template: gres.conf.j2 +openhpc_cgroup_template: cgroup.conf.j2 openhpc_state_save_location: /var/spool/slurm +openhpc_slurmd_spool_dir: /var/spool/slurm +openhpc_slurm_conf_path: /etc/slurm/slurm.conf +openhpc_slurm_conf_template: slurm.conf.j2 # Accounting openhpc_slurm_accounting_storage_host: "{{ openhpc_slurmdbd_host }}" @@ -80,6 +84,15 @@ openhpc_enable: database: false runtime: false +# Only used for install-generic.yml: +openhpc_generic_packages: + - munge + - mariadb-connector-c # only required on slurmdbd + - hwloc-libs # only required on slurmd +openhpc_sbin_dir: /usr/sbin # path to slurm daemon binaries (e.g. slurmctld) +openhpc_bin_dir: /usr/bin # path to slurm user binaries (e.g sinfo) +openhpc_lib_dir: /usr/lib64/slurm # path to slurm libraries + # Repository configuration openhpc_extra_repos: [] @@ -127,12 +140,9 @@ ohpc_default_extra_repos: gpgcheck: true gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-8" -# Concatenate all repo definitions here -ohpc_repos: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] + ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}" - openhpc_munge_key_b64: openhpc_login_only_nodes: '' -openhpc_module_system_install: true +openhpc_module_system_install: true # only works for install-ohpc.yml/main.yml # Auto detection openhpc_ram_multiplier: 0.95 diff --git a/tasks/install-generic.yml b/tasks/install-generic.yml new file mode 100644 index 0000000..a767797 --- /dev/null +++ b/tasks/install-generic.yml @@ -0,0 +1,72 @@ +- include_tasks: pre.yml + +- name: Create a list of slurm daemons + set_fact: + _ohpc_daemons: "{{ _ohpc_daemon_map | dict2items | selectattr('value') | items2dict | list }}" + vars: + _ohpc_daemon_map: + slurmctld: "{{ openhpc_enable.control }}" + slurmd: "{{ openhpc_enable.batch }}" + slurmdbd: "{{ openhpc_enable.database }}" + +- name: Ensure extra repos + ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module] + loop: "{{ openhpc_extra_repos }}" + loop_control: + label: "{{ item.name }}" + +- name: Install system packages + dnf: + name: "{{ openhpc_generic_packages }}" + +- name: Create Slurm user + user: + name: slurm + comment: SLURM resource manager + home: /etc/slurm + shell: /sbin/nologin + +- name: Create Slurm unit files + template: + src: "{{ item }}.service.j2" + dest: /lib/systemd/system/{{ item }}.service + owner: root + group: root + mode: ug=rw,o=r + loop: "{{ _ohpc_daemons }}" + register: _slurm_systemd_units + +- name: Get current library locations + shell: + cmd: "ldconfig -v | grep -v ^$'\t'" # noqa: no-tabs risky-shell-pipe + register: _slurm_ldconfig + changed_when: false + +- name: Add library locations to ldd search path + copy: + dest: /etc/ld.so.conf.d/slurm.conf + content: "{{ openhpc_lib_dir }}" + owner: root + group: root + mode: ug=rw,o=r + when: openhpc_lib_dir not in _ldd_paths + vars: + _ldd_paths: "{{ _slurm_ldconfig.stdout_lines | map('split', ':') | map('first') }}" + +- name: Reload Slurm unit files + # Can't do just this from systemd module + command: systemctl daemon-reload # noqa: command-instead-of-module no-changed-when no-handler + when: _slurm_systemd_units.changed + +- name: Prepend $PATH with slurm user binary location + lineinfile: + path: /etc/environment + line: "{{ new_path }}" + regexp: "^{{ new_path | regex_escape }}" + owner: root + group: root + mode: u=gw,go=r + vars: + new_path: PATH="{{ openhpc_bin_dir }}:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin" + +- meta: reset_connection # to get new environment diff --git a/tasks/install.yml b/tasks/install-ohpc.yml similarity index 75% rename from tasks/install.yml rename to tasks/install-ohpc.yml index b3a9b91..fd07bbd 100644 --- a/tasks/install.yml +++ b/tasks/install-ohpc.yml @@ -3,16 +3,14 @@ - include_tasks: pre.yml - name: Ensure OpenHPC repos - ansible.builtin.yum_repository: - name: "{{ item.name }}" - description: "{{ item.description | default(omit) }}" - file: "{{ item.file }}" - baseurl: "{{ item.baseurl | default(omit) }}" - metalink: "{{ item.metalink | default(omit) }}" - mirrorlist: "{{ item.mirrorlist | default(omit) }}" - gpgcheck: "{{ item.gpgcheck | default(omit) }}" - gpgkey: "{{ item.gpgkey | default(omit) }}" - loop: "{{ ohpc_repos }}" + ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module] + loop: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] }}" + loop_control: + label: "{{ item.name }}" + +- name: Ensure extra repos + ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module] + loop: "{{ ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}" loop_control: label: "{{ item.name }}" diff --git a/tasks/main.yml b/tasks/main.yml index bd10aaa..2f9569c 100644 --- a/tasks/main.yml +++ b/tasks/main.yml @@ -8,7 +8,7 @@ - name: Install packages block: - - include_tasks: install.yml + - include_tasks: install-ohpc.yml when: openhpc_enable.runtime | default(false) | bool tags: install diff --git a/tasks/runtime.yml b/tasks/runtime.yml index c83cd32..7f9b0ae 100644 --- a/tasks/runtime.yml +++ b/tasks/runtime.yml @@ -11,12 +11,19 @@ - name: Ensure Slurm directories exists file: - path: "{{ openhpc_state_save_location }}" + path: "{{ item.path }}" owner: slurm group: slurm - mode: 0755 + mode: '0755' state: directory - when: inventory_hostname == openhpc_slurm_control_host + loop: + - path: "{{ openhpc_state_save_location }}" # StateSaveLocation + enable: control + - path: "{{ openhpc_slurm_conf_path | dirname }}" + enable: control + - path: "{{ openhpc_slurmd_spool_dir }}" # SlurmdSpoolDir + enable: batch + when: "openhpc_enable[item.enable] | default(false) | bool" - name: Retrieve Munge key from control host # package install generates a node-unique one @@ -32,7 +39,7 @@ dest: "/etc/munge/munge.key" owner: munge group: munge - mode: 0400 + mode: '0400' register: _openhpc_munge_key_copy - name: Ensure JobComp logfile exists @@ -41,7 +48,7 @@ state: touch owner: slurm group: slurm - mode: 0644 + mode: '0644' access_time: preserve modification_time: preserve when: openhpc_slurm_job_comp_type == 'jobcomp/filetxt' @@ -49,7 +56,7 @@ - name: Template slurmdbd.conf template: src: slurmdbd.conf.j2 - dest: /etc/slurm/slurmdbd.conf + dest: "{{ openhpc_slurm_conf_path | dirname }}/slurmdbd.conf" mode: "0600" owner: slurm group: slurm @@ -58,11 +65,11 @@ - name: Template slurm.conf template: - src: slurm.conf.j2 - dest: /etc/slurm/slurm.conf + src: "{{ openhpc_slurm_conf_template }}" + dest: "{{ openhpc_slurm_conf_path }}" owner: root group: root - mode: 0644 + mode: '0644' when: openhpc_enable.control | default(false) notify: - Restart slurmctld service @@ -72,7 +79,7 @@ - name: Create gres.conf template: src: "{{ openhpc_gres_template }}" - dest: /etc/slurm/gres.conf + dest: "{{ openhpc_slurm_conf_path | dirname }}/gres.conf" mode: "0600" owner: slurm group: slurm @@ -85,8 +92,8 @@ - name: Template cgroup.conf # appears to be required even with NO cgroup plugins: https://slurm.schedmd.com/cgroups.html#cgroup_design template: - src: cgroup.conf.j2 - dest: /etc/slurm/cgroup.conf + src: "{{ openhpc_cgroup_template }}" + dest: "{{ openhpc_slurm_conf_path | dirname }}/cgroup.conf" mode: "0644" # perms/ownership based off src from ohpc package owner: root group: root @@ -96,15 +103,6 @@ register: ohpc_cgroup_conf # NB uses restart rather than reload as this is needed in some cases -- name: Remove local tempfile for slurm.conf templating - ansible.builtin.file: - path: "{{ _slurm_conf_tmpfile.path }}" - state: absent - when: _slurm_conf_tmpfile.path is defined - delegate_to: localhost - changed_when: false # so molecule doesn't fail - become: no - - name: Ensure Munge service is running service: name: munge @@ -129,7 +127,9 @@ changed_when: true when: - openhpc_slurm_control_host in ansible_play_hosts - - hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler + - hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or + hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or + hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler notify: - Restart slurmd service @@ -143,7 +143,7 @@ create: yes owner: root group: root - mode: 0644 + mode: '0644' when: - openhpc_enable.batch | default(false) notify: diff --git a/templates/slurm.conf.j2 b/templates/slurm.conf.j2 index ffd4057..725cad7 100644 --- a/templates/slurm.conf.j2 +++ b/templates/slurm.conf.j2 @@ -2,7 +2,7 @@ ClusterName={{ openhpc_cluster_name }} # PARAMETERS {% for k, v in openhpc_default_config | combine(openhpc_config) | items %} -{% if v != "omit" %}{# allow removing items using setting key: null #} +{% if v != "omit" %}{# allow removing items using setting key: omit #} {% if k != 'SlurmctldParameters' %}{# handled separately due to configless mode #} {{ k }}={{ v | join(',') if (v is sequence and v is not string) else v }} {% endif %} diff --git a/templates/slurmctld.service.j2 b/templates/slurmctld.service.j2 new file mode 100644 index 0000000..86d73d2 --- /dev/null +++ b/templates/slurmctld.service.j2 @@ -0,0 +1,22 @@ +[Unit] +Description=Slurm controller daemon +After=network-online.target munge.service +Wants=network-online.target +ConditionPathExists={{ openhpc_slurm_conf_path }} + +[Service] +Type=simple +EnvironmentFile=-/etc/sysconfig/slurmctld +EnvironmentFile=-/etc/default/slurmctld +ExecStart={{ openhpc_sbin_dir }}/slurmctld -D -s -f {{ openhpc_slurm_conf_path }} $SLURMCTLD_OPTIONS +ExecReload=/bin/kill -HUP $MAINPID +LimitNOFILE=65536 +TasksMax=infinity + +# Uncomment the following lines to disable logging through journald. +# NOTE: It may be preferable to set these through an override file instead. +#StandardOutput=null +#StandardError=null + +[Install] +WantedBy=multi-user.target diff --git a/templates/slurmd.service.j2 b/templates/slurmd.service.j2 new file mode 100644 index 0000000..501d0e9 --- /dev/null +++ b/templates/slurmd.service.j2 @@ -0,0 +1,25 @@ +[Unit] +Description=Slurm node daemon +After=munge.service network-online.target remote-fs.target +Wants=network-online.target + +[Service] +Type=simple +EnvironmentFile=-/etc/sysconfig/slurmd +EnvironmentFile=-/etc/default/slurmd +ExecStart={{ openhpc_sbin_dir }}/slurmd -D -s $SLURMD_OPTIONS +ExecReload=/bin/kill -HUP $MAINPID +KillMode=process +LimitNOFILE=131072 +LimitMEMLOCK=infinity +LimitSTACK=infinity +Delegate=yes +TasksMax=infinity + +# Uncomment the following lines to disable logging through journald. +# NOTE: It may be preferable to set these through an override file instead. +#StandardOutput=null +#StandardError=null + +[Install] +WantedBy=multi-user.target diff --git a/templates/slurmdbd.service.j2 b/templates/slurmdbd.service.j2 new file mode 100644 index 0000000..591f1d5 --- /dev/null +++ b/templates/slurmdbd.service.j2 @@ -0,0 +1,22 @@ +[Unit] +Description=Slurm DBD accounting daemon +After=network-online.target munge.service mysql.service mysqld.service mariadb.service +Wants=network-online.target +ConditionPathExists={{ openhpc_slurm_conf_path | dirname + '/slurmdbd.conf' }} + +[Service] +Type=simple +EnvironmentFile=-/etc/sysconfig/slurmdbd +EnvironmentFile=-/etc/default/slurmdbd +ExecStart={{ openhpc_sbin_dir }}/slurmdbd -D -s $SLURMDBD_OPTIONS +ExecReload=/bin/kill -HUP $MAINPID +LimitNOFILE=65536 +TasksMax=infinity + +# Uncomment the following lines to disable logging through journald. +# NOTE: It may be preferable to set these through an override file instead. +#StandardOutput=null +#StandardError=null + +[Install] +WantedBy=multi-user.target