fix: prevent munge boot race when munge.key is not yet deployed#336
Open
mtibben wants to merge 2 commits intoGoogleCloudPlatform:masterfrom
Open
fix: prevent munge boot race when munge.key is not yet deployed#336mtibben wants to merge 2 commits intoGoogleCloudPlatform:masterfrom
mtibben wants to merge 2 commits intoGoogleCloudPlatform:masterfrom
Conversation
On compute and login nodes, munge.service is enabled in the image but munge.key is not deployed until the startup script runs. Without this condition, munge fails at boot and enters a failed state, causing nodes to get stuck as NOT_RESPONDING+POWERING_UP. ConditionPathExists causes systemd to skip (not fail) the service when the key is absent, so it can be started cleanly by the setup script once the key is deployed.
There was a problem hiding this comment.
Code Review
This pull request updates the munge.service template to include a ConditionPathExists directive, ensuring the munge key is present before the service starts. The review feedback suggests using an Ansible variable for the key path instead of hardcoding it to improve the role's flexibility.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
munge.serviceis enabled in the image, so systemd starts it at boot. However, on compute and login nodes/etc/munge/munge.keyis not present in the image — it is deployed at runtime by the node's startup process. This creates a race: munge fails and enters a failed state before the key arrives. When the startup script later callssystemctl restart munge, it succeeds, but the transient failure can leave dependent services in a bad state.Adding
ConditionPathExists=/etc/munge/munge.keyto the[Unit]section causes systemd to skip (not fail) the service when the key is absent. The unit is not marked as failed, and once the key is deployed the startup script can start munge cleanly.An earlier attempt to fix this in cluster-toolkit's startup scripts (GoogleCloudPlatform/cluster-toolkit#5423) is incomplete: it installs the drop-in at runtime during node setup, which protects subsequent reboots but not the first boot — the race has already occurred before the startup script runs. Fixing it in the image is the correct approach.