Skip to content

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Apr 23, 2025

See ansible/roles/nhc/README.md. Checks are templated out on each compute node during site.yml and then restored after a slurm-controlled rebuild, so that checks after rebuild ensure the node is in the same state as it was before.

Note this also moves the configuration of nfs-exported data for compute-init to the end of the site.yml playbook, to make ordering of roles requiring compute-init support easier.

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 23, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 23, 2025

@sjpb sjpb force-pushed the feat/nhc-v2 branch 3 times, most recently from 94e7006 to f999b12 Compare June 18, 2025 08:04
@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

Manual checks that this works:

  • Deploy cluster at this branch
  • Remove NFS mount and check node goes DOWN. Remount.
  • Remove ansible-init sentinel, run ansible-init again and check that works

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

Failed manual checks when deploying:

TASK [nhc : Fetch generated NHC config to control node storage] ********************************************************
Wednesday 18 June 2025  09:37:50 +0000 (0:00:00.134)       0:12:55.714 ******** 
fatal: [RL9-compute-1]: FAILED! => {}

MSG:

Unable to create local directories(/exports/cluster/hostconfigRL9-compute-1): [Errno 13] Permission denied: b'/exports'
fatal: [RL9-compute-0]: FAILED! => {}

MSG:

Unable to create local directories(/exports/cluster/hostconfigRL9-compute-0): [Errno 13] Permission denied: b'/exports'

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

Ok so the problem is:

  • The image we're using on this branch now has the npc-ohpc package in it.
  • In CI, there is a point where the control node and login node are on that image (b/c TF has upgraded them to it) but the compute nodes aren't (b/c they have slurm-controlled rebuild). We run site.yml at that point.
  • The slurm.yml playbook skips the openhpc installs for speed. But this means the compute nodes don't get that package. Then the playbook tries to configure nhc, which fails.

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

Commit above changed approach b/c there is a problem; on first slurm-controlled upgrade to this branch, site.yml is run while the compute nodes haven't been upgraded and hence don't have the nhc binaries to perform autoconfiguration.

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 19, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 19, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 20, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 20, 2025

Above is failing on the /etc/nhc dir not existing. This is b/c that is created on install of the nhc-ohpc package, which is only in the new image, which the nodes don't have when running site.yml after upgrading login/control nodes.

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 20, 2025

@sjpb sjpb marked this pull request as ready for review June 20, 2025 14:23
@sjpb sjpb requested a review from a team as a code owner June 20, 2025 14:23
Copy link
Contributor

@wtripp180901 wtripp180901 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think most of these refactors have been reviewed previously and NHC installs all look sensible so I'm happy to approve

@sjpb sjpb merged commit eabf59b into main Jun 20, 2025
7 checks passed
@sjpb sjpb deleted the feat/nhc-v2 branch June 20, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants