Skip to content

fix: prevent munge startup race on compute node boot#5423

Draft
mtibben wants to merge 3 commits intoGoogleCloudPlatform:developfrom
mtibben:fix-munge-startup-race
Draft

fix: prevent munge startup race on compute node boot#5423
mtibben wants to merge 3 commits intoGoogleCloudPlatform:developfrom
mtibben:fix-munge-startup-race

Conversation

@mtibben
Copy link
Copy Markdown
Contributor

@mtibben mtibben commented Mar 30, 2026

On compute node first boot, systemd starts munge.service (which is enabled in the SLURM image) before the startup script has deployed /etc/munge/munge.key via setup_network_storage(). Munge fails and enters a failed systemd state.

The existing systemctl restart munge in setup_compute() runs after the key is deployed, but a prior failed state can leave the munge socket unreliable, causing slurmd to be unable to authenticate with slurmctld even though both services appear running. This manifests as nodes stuck in NOT_RESPONDING+POWERING_UP with jobs hanging in CONFIGURING.

This fix installs a systemd drop-in for munge.service with ConditionPathExists=/etc/munge/munge.key before setup_network_storage() runs. This causes systemd to skip (not fail) the munge autostart when the key is absent — leaving it inactive rather than failed. A systemctl reset-failed munge also clears any failed state from the current boot's race before the key-deployment restart.

On first boot, systemd starts munge.service before the startup script
has a chance to deploy /etc/munge/munge.key via setup_network_storage().
Munge fails and enters a failed state, which can leave slurmd unable to
authenticate with slurmctld even after setup_compute() restarts munge
with the key in place.

Install a systemd drop-in (ConditionPathExists=/etc/munge/munge.key)
before deploying the key so munge is skipped (inactive) rather than
failing at boot. Also call systemctl reset-failed munge to clear any
failed state from the race before the restart.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a race condition occurring during compute node boot where the munge service attempts to start before its required key is deployed. By introducing a conditional check via a systemd drop-in, the service remains inactive rather than entering a failed state, ensuring reliable authentication for slurmd upon subsequent startup.

Highlights

  • Munge Startup Race Condition: Implemented a systemd drop-in configuration to prevent munge.service from starting before the munge.key is available on compute nodes.
  • Systemd State Management: Added a reset-failed command to ensure any previous failed states are cleared before the service is properly initialized.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions bot added the external PR from external contributor label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the setup script to include a systemd override for the munge service, ensuring it only starts once the munge key is present to avoid boot-time race conditions. The review feedback suggests using the available dirs.munge object to reference the munge key path rather than hardcoding it, enhancing code maintainability.

@mtibben mtibben marked this pull request as ready for review March 30, 2026 04:32
@mtibben mtibben requested review from a team and samskillman as code owners March 30, 2026 04:32
@SwarnaBharathiMantena SwarnaBharathiMantena added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Mar 30, 2026
@SwarnaBharathiMantena
Copy link
Copy Markdown
Contributor

/gcbrun

@mtibben
Copy link
Copy Markdown
Contributor Author

mtibben commented Mar 30, 2026

Hmmmm I've discovered this PR is not a complete fix

The runtime drop-in in setup_compute() only protects subsequent reboots. On first boot, the drop-in doesn't exist yet when systemd starts munge, so munge still fails before the startup script runs. We've worked around this locally by baking the drop-in into our node image via our own Packer bootstrap layer.

For a complete fix, the drop-in would need to be in the base SLURM GCP image - I've created a PR at GoogleCloudPlatform/slurm-gcp#336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PR from external contributor release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants