Skip to content

fix: prevent munge boot race when munge.key is not yet deployed#336

Open
mtibben wants to merge 2 commits intoGoogleCloudPlatform:masterfrom
mtibben:fix-munge-boot-race
Open

fix: prevent munge boot race when munge.key is not yet deployed#336
mtibben wants to merge 2 commits intoGoogleCloudPlatform:masterfrom
mtibben:fix-munge-boot-race

Conversation

@mtibben
Copy link
Copy Markdown

@mtibben mtibben commented Mar 30, 2026

munge.service is enabled in the image, so systemd starts it at boot. However, on compute and login nodes /etc/munge/munge.key is not present in the image — it is deployed at runtime by the node's startup process. This creates a race: munge fails and enters a failed state before the key arrives. When the startup script later calls systemctl restart munge, it succeeds, but the transient failure can leave dependent services in a bad state.

Adding ConditionPathExists=/etc/munge/munge.key to the [Unit] section causes systemd to skip (not fail) the service when the key is absent. The unit is not marked as failed, and once the key is deployed the startup script can start munge cleanly.

An earlier attempt to fix this in cluster-toolkit's startup scripts (GoogleCloudPlatform/cluster-toolkit#5423) is incomplete: it installs the drop-in at runtime during node setup, which protects subsequent reboots but not the first boot — the race has already occurred before the startup script runs. Fixing it in the image is the correct approach.

On compute and login nodes, munge.service is enabled in the image but
munge.key is not deployed until the startup script runs. Without this
condition, munge fails at boot and enters a failed state, causing nodes
to get stuck as NOT_RESPONDING+POWERING_UP.

ConditionPathExists causes systemd to skip (not fail) the service when
the key is absent, so it can be started cleanly by the setup script
once the key is deployed.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the munge.service template to include a ConditionPathExists directive, ensuring the munge key is present before the service starts. The review feedback suggests using an Ansible variable for the key path instead of hardcoding it to improve the role's flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant