Skip to content

Conversation

bertiethorpe
Copy link
Member

No description provided.

@bertiethorpe bertiethorpe requested a review from a team as a code owner November 13, 2024 13:18
@bertiethorpe bertiethorpe requested a review from sjpb November 13, 2024 13:18
@sjpb sjpb marked this pull request as draft November 13, 2024 13:19
@bertiethorpe
Copy link
Member Author

Tested by running (from deploy host)
ansible-playbook -v ansible/adhoc/rebuild.yml --limit compute
ansible-playbook -v ansible/extras.yml --tags compute_init

Then in the compute node:
sudo rm /var/lib/ansible-init.done
sudo systemctl restart ansible-init
sudo scontrol reconfigure

Check status of nodes:
scontrol show nodes

Run tests (from deploy host):
ansible-playbook ansible/adhoc/hpctests.yml -v

@bertiethorpe bertiethorpe marked this pull request as ready for review November 19, 2024 16:11
@bertiethorpe bertiethorpe marked this pull request as draft November 20, 2024 15:34
@bertiethorpe bertiethorpe marked this pull request as ready for review November 20, 2024 15:39
group: root
mode: 0644
loop:
- ../../basic_users/library/terminate_user_sessions.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this; there's no way there can be sessions for local users we want to remove running on boot, which is the only case we need this

loop:
- ../../basic_users/library/terminate_user_sessions.py
- ../../stackhpc.os-manila-mount/library/os_manila_share.py
- ../../stackhpc.openhpc/library/sacct_cluster.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, not used at all (as shown by grep). Some sleuthing found the taskfile which used it was removed in stackhpc.openhpc v0.22 as no longer required, so we should delete it from that role!

mode: 0644
loop:
- ../../basic_users/filter_plugins/filter_keys.py
- ../../stackhpc.openhpc/filter_plugins/slurm_conf.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some grepping shows this provides:

  • hostlist_expression: only used for control node templating slurm.conf and gres.conf, not relevant here
  • dict2parameters: only used for control node templating slurm.conf, not relevant here
  • error: can't find where this is used

So remove?

- ../../basic_users/filter_plugins/filter_keys.py
- ../../stackhpc.openhpc/filter_plugins/slurm_conf.py

- name: Add filter_plugins ansible.cfg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok as a workaround. Should we move to ansible-init's own cfg definition at some point?

state: directory
owner: root
group: root
mode: 0755
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're putting secrets in here, is this OK?

@@ -0,0 +1,150 @@
---

- name: Ensure directories exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is running on group compute_init which via everything layout defaults to cluster.

So you're creating the directories etc. on EVERY node. Whereas we only want to do that on the control node.

cluster

[compute_init:children]
# Hosts to deploy compute initialisation ansible-init script to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this description is right.

Deploying the actual compute init script is/will be done in the image build.

This should control which hosts get info templated out (and eventually, metadata set to turn on the feature, I think).


[compute_init:children]
# Hosts to deploy compute initialisation ansible-init script to.
cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right, it needs to either be builder or compute. I need to discuss.

nfs_enable:
server: "{{ inventory_hostname in groups['control'] }}"
clients: false
nfs_export: "/exports/cluster" # control node has to copy in /etc/hosts to here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nfs_export: "/exports/cluster" # control node has to copy in /etc/hosts to here
nfs_export: "/exports/cluster"

[ansible_init]
# Hosts to run linux-anisble-init

[compute_init]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments on everything.

@bertiethorpe bertiethorpe merged commit a5cbc58 into main Dec 20, 2024
@bertiethorpe bertiethorpe deleted the feat/compute-script branch December 20, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants