Upgrade MUNGE in a running cluster

WARNING This procedure requires to stop both the compute fleet and login node fleet. Both fleets are going to be replaced, so this will imply downtime and you also make make sure you have backed up all local data to prevent data loss.

Introduction

MUNGE is the service used by Slurm to authenticate cluster nodes. This guide describes how to upgrade MUNGE on your running cluster. The procedure has the following main steps:

STEP 1 — Upgrade MUNGE on the Head Node
STEP 2 — Upgrade MUNGE on Compute/Login Nodes
STEP 3 — (Optional) Rotate the MUNGE key

STEP 1 — Upgrade MUNGE on the Head Node

Keep note of the current version and backup current configuration

munged --version
MUNGE_FILES=(
  /etc/munge/
  /usr/bin/munge
  /usr/bin/remunge
  /usr/bin/unmunge
  /usr/sbin/mungekey
  /usr/sbin/munged
  /usr/lib/systemd/system/munge.service
  /usr/lib*/libmunge*
)

rsync -aR "${MUNGE_FILES[@]}" /opt/parallelcluster/sources/munge-backup

Install new version of MUNGE

MUNGE_VERSION="0.5.18"
sudo mkdir -p /opt/parallelcluster/sources
# If you need to access the regional S3 bucket you can update the S3 url below
sudo curl -fsSL -o /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/archives/dependencies/munge/munge-${MUNGE_VERSION}.tar.xz
sudo tar -xvf /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz -C /opt/parallelcluster/sources
cd /opt/parallelcluster/sources/munge-${MUNGE_VERSION}
grep -qi ubuntu /etc/os-release && MUNGE_LIBDIR="/usr/lib" || MUNGE_LIBDIR="/usr/lib64"
sudo ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --libdir=${MUNGE_LIBDIR}
sudo make install

Restart MUNGE

sudo systemctl daemon-reload
sudo systemctl restart munge

Validate MUNGE is functional

# Expected: munge-0.5.18 (2026-02-10)
munged --version

# Expected: active (running)
sudo systemctl status munge

# Expected: STATUS:          Success (0)
sudo munge -n | unmunge

# Expected: libmunge.so.2.0.1
cat "/proc/$(systemctl show munge --property=MainPID --value)/maps" | grep 'libmunge.so'

# Expected: job execution succeeded
srun hostname

STEP 2 — Upgrade MUNGE on Compute/Login Nodes

To upgrade MUNGE on compute and login nodes, you have the following options:

OPTION 1 — Upgrade MUNGE using Custom Action
OPTION 2 — Upgrade MUNGE using Custom AMI

OPTION 1 — Upgrade MUNGE using Custom Action

While on the Head Node, copy binaries and configuration to a shared folder.

MUNGE_FILES=(
  /usr/bin/munge
  /usr/bin/remunge
  /usr/bin/unmunge
  /usr/sbin/mungekey
  /usr/sbin/munged
  /usr/lib/systemd/system/munge.service
  /usr/lib/sysusers.d/munge.conf # NOT REQUIRED on AL2
  /usr/lib*/libmunge*
)

rsync -aR "${MUNGE_FILES[@]}" /opt/parallelcluster/shared/.munge/

Create a custom action script and store it in S3, e.g. s3://<BUCKET>/PatchMunge.sh

#!/bin/bash
set -ex

sudo rsync -a /opt/parallelcluster/shared/.munge/usr/ /usr/
sudo systemctl daemon-reload

Stop the compute fleet and the login node fleet. To stop the login node fleet scale down the login node fleet to 0 nodes by setting Count to 0 and update the cluster.
Use the custom action OnNodeStart in every compute node and login pool (remember that to use custom actions you also need to provide read permissions on the object, e.g. by using AdditionalIamPolicies). If you are using login nodes, you also need to change the login node pool name to add the custom action.

CustomActions:
  OnNodeStart:
    Script: s3://<BUCKET>/PatchMunge.sh

Submit a cluster update. The update will follow the configured QueueUpdateStrategy to apply the changes to running nodes.

pcluster update-cluster \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --cluster-configuration <CONFIG_FILE>

Wait for the cluster status to be UPDATE_COMPLETE

pcluster describe-cluster \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --query 'clusterStatus'

Restart the compute-fleet. If you are using login nodes, restore the original number of login nodes by setting the value of Count and update the cluster.
Validate that MUNGE is working on a new compute node

# Expected: 
# munge-0.5.18 (2026-02-10)
# STATUS:          Success (0)
# hostname of your compute node
# libmunge.so.2.0.1
srun -w <NEW_NODE> bash -c "munged --version ; sudo munge -n | unmunge ; hostname ; cat \"/proc/\$(systemctl show munge --property=MainPID --value)/maps\" | grep 'libmunge.so'"

OPTION 2 — Upgrade MUNGE using Custom AMI

Run an EC2 instance with the AMI you’re using in your cluster (Take the AMI ID from the cluster config)
Download MUNGE using the following script.

MUNGE_VERSION="0.5.18"
sudo mkdir -p /opt/parallelcluster/sources
# If you need to access the regional S3 bucket you can update the S3 url below
sudo curl -fsSL -o /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/archives/dependencies/munge/munge-${MUNGE_VERSION}.tar.xz
sudo tar -xvf /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz -C /opt/parallelcluster/sources
cd /opt/parallelcluster/sources/munge-${MUNGE_VERSION}
grep -qi ubuntu /etc/os-release && MUNGE_LIBDIR="/usr/lib" || MUNGE_LIBDIR="/usr/lib64"
sudo ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --libdir=${MUNGE_LIBDIR}
sudo make install
sudo systemctl daemon-reload

Check the MUNGE version

# Expected: 
# munge-0.5.18 (2026-02-10)
munged --version

Cleanup instance for the creation of new AMI

sudo /usr/local/sbin/ami_cleanup.sh

Stop the instance
Create AMI from the instance and keep note of the AMI ID
Use the new custom AMI in cluster config CustomAmi for every queue and login node pool.

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: your-queue-name
      ...
      Image:
        CustomAmi: ami-XXXXXXXXXXXX

# Login node section required only if you have login nodes
LoginNodes:
  Pools:
    - Name: your-login-pool-name
      ...
      Image:
        CustomAmi: ami-XXXXXXXXXXXX

Stop the compute fleet and the login node fleet. To stop the login node fleet scale down the login node fleet to 0 nodes by setting Count to 0 and update the cluster.
Check that no login nodes are running.
Update the cluster

pcluster update-cluster \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --cluster-configuration <CONFIG_FILE>

Wait for the cluster status to be UPDATE_COMPLETE

pcluster describe-cluster \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --query 'clusterStatus'

Restart the compute-fleet. If you are using login nodes, restore the original number of login nodes by setting the value of Count and update the cluster.
Validate that MUNGE is working on new compute node

# Expected: 
# munge-0.5.18 (2026-02-10)
# STATUS:          Success (0)
# hostname of your compute node
# libmunge.so.2.0.1
srun -w <NEW_NODE> bash -c "munged --version ; sudo munge -n | unmunge ; hostname ; cat \"/proc/\$(systemctl show munge --property=MainPID --value)/maps\" | grep 'libmunge.so'"

STEP 3 — (Optional) Rotate the MUNGE key

Updating the MUNGE key is optional, but recommended if you think your key has been compromised. You have the following options:

OPTION 1 — If you are using the default MungeKey
OPTION 2 — If you are using a custom MUNGE key

OPTION 1 — If you are using the default MungeKey

Stop the compute fleet

pcluster update-compute-fleet \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --status STOP_REQUESTED

Stop daemons on the Head Node

sudo systemctl stop slurmctld
sudo systemctl stop slurmdbd # Optional: only if you are using Slurm Accounting
sudo systemctl stop slurmdrestd # Optional: only if you are using Slurm REST API
sudo systemctl stop munge

Generate a new MUNGE key on the Head Node

sudo /usr/sbin/mungekey --verbose --force
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 0600 /etc/munge/munge.key

Share the MUNGE key from the Head Node with cluster nodes

for shared_folder in "/opt/parallelcluster/shared" "/opt/parallelcluster/shared_login_nodes"; do
  sudo cp -p "/etc/munge/munge.key" "${shared_folder}/.munge/.munge.key"
done

Verify you’re sharing the right key

# Expected: no diff, perms 600 munge:munge
for shared_folder in "/opt/parallelcluster/shared" "/opt/parallelcluster/shared_login_nodes"; do
  sudo diff "/etc/munge/munge.key" "${shared_folder}/.munge/.munge.key"
  sudo stat -c '%a %U:%G %n' "/etc/munge/munge.key" "${shared_folder}/.munge/.munge.key"
done

Start daemons on the Head Node

sudo systemctl start munge
sudo systemctl start slurmctld
sudo systemctl start slurmdbd # Optional: only if you are using Slurm Accounting
sudo systemctl start slurmdrestd # Optional: only if you are using Slurm REST API

Start the compute-fleet

pcluster update-compute-fleet \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --status START_REQUESTED

If you have login nodes, terminate all of them, so that fresh login nodes will be provisioned using the new MUNGE key.

OPTION 2 — If you are using a custom MUNGE key

Generate a new MUNGE key body from the Head Node, or whatever linux host

dd if=/dev/random bs=128 count=1 2>/dev/null | base64 -w 0

Create a new MUNGE key on Secrets Manager and take note of the ARN

aws secretsmanager create-secret \
  --region <REGION> \
  --name <SECRET_NAME> \
  --secret-string <GENERATED_KEY> \
  --query "ARN" \
  --output text

Stop the compute fleet

pcluster update-compute-fleet \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --status STOP_REQUESTED

Update the cluster config MungeKeySecretArn to use the new secret

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    MungeKeySecretArn: <NEW_MUNGE_KEY_SECRET_ARN>

Update cluster

pcluster update-cluster \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --cluster-configuration config.yaml

Wait for the cluster status to be UPDATE_COMPLETE

pcluster describe-cluster \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --query 'clusterStatus'

Start the compute-fleet

pcluster update-compute-fleet \
  --region <REGION> \
  --cluster-name <CLUSTER_NAME> \
  --status START_REQUESTED

Validate MUNGE key rotation

Verify MUNGE is working

# Expected: 
# munge-0.5.18 (2026-02-10)
# STATUS:          Success (0)
# hostname of your compute node
srun -w <NEW_NODE> bash -c "munged --version ; sudo munge -n | unmunge ; hostname"

Upgrade MUNGE in a running cluster

Introduction

STEP 1 — Upgrade MUNGE on the Head Node

STEP 2 — Upgrade MUNGE on Compute/Login Nodes

OPTION 1 — Upgrade MUNGE using Custom Action

OPTION 2 — Upgrade MUNGE using Custom AMI

STEP 3 — (Optional) Rotate the MUNGE key

OPTION 1 — If you are using the default MungeKey

OPTION 2 — If you are using a custom MUNGE key

Validate MUNGE key rotation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!