Skip to content

Add possibility to install a custom compiled slurm package#32

Open
ahmam wants to merge 21 commits intogalaxyproject:mainfrom
mila-iqia:mila
Open

Add possibility to install a custom compiled slurm package#32
ahmam wants to merge 21 commits intogalaxyproject:mainfrom
mila-iqia:mila

Conversation

@ahmam
Copy link

@ahmam ahmam commented Apr 28, 2023

Add the possibility to install custom slurm packages, the objective to install the most recent vesrion slurm on debian or ubuntu.

ahmam and others added 5 commits April 24, 2023 14:52
* Update main.yml

* Update defaults/main.yml

Co-authored-by: Bruno Travouillon <devel@travouillon.fr>

* install custom packages

* install custom packages

* install custom packages

* add custom debain repositories

* install custom packages

* install custom packages

* fix typo dynamique vars

* fix the FQCN

* add configure custom repos

* add example of custom repo

* add new  README.md

Co-authored-by: Bruno Travouillon <devel@travouillon.fr>

* remove custom package installtion

* remove condition slurm_apt_repository

* fix typo

* remove slurm_apt_repository condition

* remove slurm_apt_repository condition

* fix custom log directory creation

* move var to first part

* rename the var to static

* fix var

---------

Co-authored-by: Bruno Travouillon <devel@travouillon.fr>
Co-authored-by: ahmed <you@example.com>
(cherry picked from commit 712cf32)
@@ -0,0 +1,20 @@
---

- name: Check for existence of cluster in db.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ansible-lint] reported by reviewdog 🐶
no-changed-when Commands should not change things if nothing needs doing

@@ -0,0 +1,20 @@
---

- name: Check for existence of cluster in db.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ansible-lint] reported by reviewdog 🐶
risky-shell-pipe Shells that use pipes should set the pipefail option

btravouillon and others added 7 commits October 2, 2023 11:49
SlurmctldHost can be defined more than once in slurm.conf to define a
primary and backups Slurm controllers. If SlurmctldHost is defined as a
list in `slurm_config`, then the template will add one line per list
item in slurm.conf..
When ansible-playbook is called with `-e slurm_start_services=false`,
the variable is evaluated as a string (which is true). Apply the
`| bool` filter to evaluate the variable as a boolean.
While the new feature to create a new SlurmDBD cluster could be nice in
some cases, it causes some issues when one wants to reinstall a node
with an existing database. IMHO, it is safer to disable this by default.
One should opt-in for Ansible to create a new cluster.
Make sure to execute the correct Slurm after the upgrade. Otherwise,
this may cause issues when reloading the service during logrotate since
some files changed on the system.
Make sure to execute the correct Slurm after the upgrade. Otherwise,
this may cause issues when reloading the service during logrotate since
some files changed on the system.
Comment on lines +6 to +9

- name: Include Configure custom repositories
ansible.builtin.include_tasks: repositories-Debian.yml
when: slurm_configure_repos

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this support RHEL-like distros as well?
At least, this task should not execute when a host's family OS is not Debian.

Upon changes, some tasks will trigger the reload handler while others
will trigger the restart handler. During initial installation of Slurm
with the role, this can lead to some race condition where reload and
restart will execute too closely one from the other, which can trigger a
failure of the second service.

When a restart handler is notified, there should be no need to trigger
the reload as well. Move the restart handlers before the reload handlers
to make sure those are executed first. Register their result in a
variable. Test if the variable is defined in the reload handler (which
means that the restart handler was executed).

Also remove one of the 'Reload slurmdbd' handler which was defined
twice.
@btravouillon
Copy link

@ahmam I guess you should use another branch than mila for this PR since this is our soft-fork branch where we apply all our changes.

automatic node rebooting through Slurm. The implementation includes:

- New reboot_program.j2 template that handles node rebooting using
  either systemd or traditional init systems
- Configuration variable slurm_reboot_program to enable/disable the
  feature
- Task to install and configure the reboot program script
- Documentation in README.md explaining how to use the feature

The reboot program includes logging of reboot attempts and proper error
handling. It will attempt to use systemd's reboot command first, falling
back to traditional shutdown command if systemd is not available.
There is no arg passed to the reboot script.

Also update the README to make it more readable.
Create a new variable slurm_topology_config to define the content of the
topology.conf file and ensure TopologyPlugin is defined in slurm.conf
when the topology must be configured.
The file topology.conf must exist when `TopologyPlugin` is defined in
slurm.conf, otherwise slurmd won't be able to start.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants