-
Notifications
You must be signed in to change notification settings - Fork 35
Add support for Node Health Checks #654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
94e7006
to
f999b12
Compare
Manual checks that this works:
|
Failed manual checks when deploying:
|
Ok so the problem is:
|
Commit above changed approach b/c there is a problem; on first slurm-controlled upgrade to this branch, site.yml is run while the compute nodes haven't been upgraded and hence don't have the nhc binaries to perform autoconfiguration. |
Above is failing on the /etc/nhc dir not existing. This is b/c that is created on install of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think most of these refactors have been reviewed previously and NHC installs all look sensible so I'm happy to approve
See ansible/roles/nhc/README.md. Checks are templated out on each compute node during site.yml and then restored after a slurm-controlled rebuild, so that checks after rebuild ensure the node is in the same state as it was before.
Note this also moves the configuration of nfs-exported data for compute-init to the end of the site.yml playbook, to make ordering of roles requiring compute-init support easier.