Skip to content

Conversation

bertiethorpe
Copy link
Member

@bertiethorpe bertiethorpe commented Jan 29, 2025

Adds rebuild role to the appliance, which uses https://github.com/stackhpc/slurm-openstack-tools//slurm_openstack_tools/reboot.py, a RebootProgram running on the control node which:

  • Reads each of the compute nodes' hostvars.yml file to get the “target” image defined.
  • Use openstack API to check the current image of the node.
  • If they do not match - reimage the node with the target image via openstack.
  • If no target image is defined (i.e. this node does have this functionality enabled) or if it matches the current image, a normal reboot (NB: not rebuild) will be carried out via openstack.

Invoked via slurm by running e.g.

# srun --reboot -N 2 uptime

bertiethorpe and others added 4 commits January 23, 2025 15:23
* define login nodes using tf module

* Apply suggestions from code review

Co-authored-by: Matt Anson <[email protected]>

* tweak README to explain compute groups

* try to clarify login/compute groups

---------

Co-authored-by: Matt Anson <[email protected]>
* change terraform references to opentofu in docs

* remove wider reference to terraform

* Update environments/README.md

Co-authored-by: Steve Brasier <[email protected]>

* Update environments/common/README.md

Co-authored-by: Steve Brasier <[email protected]>

---------

Co-authored-by: Steve Brasier <[email protected]>
@bertiethorpe bertiethorpe requested a review from a team as a code owner January 29, 2025 14:03
@bertiethorpe bertiethorpe requested a review from sjpb February 4, 2025 10:55
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor tweaks. I had a comment re. defaults or something but TBH I can't remember what that was - it was for a file not changed in this PR.

@bertiethorpe bertiethorpe requested a review from sjpb February 6, 2025 14:40
@bertiethorpe bertiethorpe changed the title Slurm Rebuild Slurm controlled reimaging RebootProgram for compute-init Feb 11, 2025
@bertiethorpe bertiethorpe changed the title Slurm controlled reimaging RebootProgram for compute-init Slurm controlled reimaging RebootProgram Feb 11, 2025
@sjpb sjpb changed the title Slurm controlled reimaging RebootProgram Support compute node rebuild/reboot via Slurm RebootProgram Feb 11, 2025
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb sjpb merged commit 112aa6e into main Feb 11, 2025
2 checks passed
@sjpb sjpb deleted the feat/slurm-rebuild branch February 11, 2025 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants