-
Notifications
You must be signed in to change notification settings - Fork 319
Upgrade MUNGE in a running cluster
WARNING This procedure requires to stop both the compute fleet and login node fleet. Both fleets are going to be replaced, so this will imply downtime and you also make make sure you have backed up all local data to prevent data loss.
MUNGE is the service used by Slurm to authenticate cluster nodes. This guide describes how to upgrade MUNGE on your running cluster. The procedure has the following main steps:
- STEP 1 — Upgrade MUNGE on the Head Node
- STEP 2 — Upgrade MUNGE on Compute/Login Nodes
- STEP 3 — (Optional) Rotate the MUNGE key
- Keep note of the current version and backup current configuration
munged --version
MUNGE_FILES=(
/etc/munge/
/usr/bin/munge
/usr/bin/remunge
/usr/bin/unmunge
/usr/sbin/mungekey
/usr/sbin/munged
/usr/lib/systemd/system/munge.service
/usr/lib*/libmunge*
)
rsync -aR "${MUNGE_FILES[@]}" /opt/parallelcluster/sources/munge-backup
- Install new version of MUNGE
MUNGE_VERSION="0.5.18"
sudo mkdir -p /opt/parallelcluster/sources
# If you need to access the regional S3 bucket you can update the S3 url below
sudo curl -fsSL -o /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/archives/dependencies/munge/munge-${MUNGE_VERSION}.tar.xz
sudo tar -xvf /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz -C /opt/parallelcluster/sources
cd /opt/parallelcluster/sources/munge-${MUNGE_VERSION}
grep -qi ubuntu /etc/os-release && MUNGE_LIBDIR="/usr/lib" || MUNGE_LIBDIR="/usr/lib64"
sudo ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --libdir=${MUNGE_LIBDIR}
sudo make install
- Restart MUNGE
sudo systemctl daemon-reload
sudo systemctl restart munge
- Validate MUNGE is functional
# Expected: munge-0.5.18 (2026-02-10)
munged --version
# Expected: active (running)
sudo systemctl status munge
# Expected: STATUS: Success (0)
sudo munge -n | unmunge
# Expected: libmunge.so.2.0.1
cat "/proc/$(systemctl show munge --property=MainPID --value)/maps" | grep 'libmunge.so'
# Expected: job execution succeeded
srun hostname
To upgrade MUNGE on compute and login nodes, you have the following options:
- While on the Head Node, copy binaries and configuration to a shared folder.
MUNGE_FILES=(
/usr/bin/munge
/usr/bin/remunge
/usr/bin/unmunge
/usr/sbin/mungekey
/usr/sbin/munged
/usr/lib/systemd/system/munge.service
/usr/lib/sysusers.d/munge.conf # NOT REQUIRED on AL2
/usr/lib*/libmunge*
)
rsync -aR "${MUNGE_FILES[@]}" /opt/parallelcluster/shared/.munge/
- Create a custom action script and store it in S3, e.g.
s3://<BUCKET>/PatchMunge.sh
#!/bin/bash
set -ex
sudo rsync -a /opt/parallelcluster/shared/.munge/usr/ /usr/
sudo systemctl daemon-reload
-
Stop the compute fleet and the login node fleet. To stop the login node fleet scale down the login node fleet to 0 nodes by setting Count to 0 and update the cluster.
-
Use the custom action OnNodeStart in every compute node and login pool (remember that to use custom actions you also need to provide read permissions on the object, e.g. by using AdditionalIamPolicies). If you are using login nodes, you also need to change the login node pool name to add the custom action.
CustomActions:
OnNodeStart:
Script: s3://<BUCKET>/PatchMunge.sh
- Submit a cluster update. The update will follow the configured QueueUpdateStrategy to apply the changes to running nodes.
pcluster update-cluster \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--cluster-configuration <CONFIG_FILE>
- Wait for the cluster status to be
UPDATE_COMPLETE
pcluster describe-cluster \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--query 'clusterStatus'
-
Restart the compute-fleet. If you are using login nodes, restore the original number of login nodes by setting the value of Count and update the cluster.
-
Validate that MUNGE is working on a new compute node
# Expected:
# munge-0.5.18 (2026-02-10)
# STATUS: Success (0)
# hostname of your compute node
# libmunge.so.2.0.1
srun -w <NEW_NODE> bash -c "munged --version ; sudo munge -n | unmunge ; hostname ; cat \"/proc/\$(systemctl show munge --property=MainPID --value)/maps\" | grep 'libmunge.so'"
-
Run an EC2 instance with the AMI you’re using in your cluster (Take the AMI ID from the cluster config)
-
Download MUNGE using the following script.
MUNGE_VERSION="0.5.18"
sudo mkdir -p /opt/parallelcluster/sources
# If you need to access the regional S3 bucket you can update the S3 url below
sudo curl -fsSL -o /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/archives/dependencies/munge/munge-${MUNGE_VERSION}.tar.xz
sudo tar -xvf /opt/parallelcluster/sources/munge-${MUNGE_VERSION}.tar.xz -C /opt/parallelcluster/sources
cd /opt/parallelcluster/sources/munge-${MUNGE_VERSION}
grep -qi ubuntu /etc/os-release && MUNGE_LIBDIR="/usr/lib" || MUNGE_LIBDIR="/usr/lib64"
sudo ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --libdir=${MUNGE_LIBDIR}
sudo make install
sudo systemctl daemon-reload
- Check the MUNGE version
# Expected:
# munge-0.5.18 (2026-02-10)
munged --version
- Cleanup instance for the creation of new AMI
sudo /usr/local/sbin/ami_cleanup.sh
-
Stop the instance
-
Create AMI from the instance and keep note of the AMI ID
-
Use the new custom AMI in cluster config CustomAmi for every queue and login node pool.
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: your-queue-name
...
Image:
CustomAmi: ami-XXXXXXXXXXXX
# Login node section required only if you have login nodes
LoginNodes:
Pools:
- Name: your-login-pool-name
...
Image:
CustomAmi: ami-XXXXXXXXXXXX
-
Stop the compute fleet and the login node fleet. To stop the login node fleet scale down the login node fleet to 0 nodes by setting Count to 0 and update the cluster.
-
Check that no login nodes are running.
-
Update the cluster
pcluster update-cluster \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--cluster-configuration <CONFIG_FILE>
- Wait for the cluster status to be
UPDATE_COMPLETE
pcluster describe-cluster \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--query 'clusterStatus'
-
Restart the compute-fleet. If you are using login nodes, restore the original number of login nodes by setting the value of Count and update the cluster.
-
Validate that MUNGE is working on new compute node
# Expected:
# munge-0.5.18 (2026-02-10)
# STATUS: Success (0)
# hostname of your compute node
# libmunge.so.2.0.1
srun -w <NEW_NODE> bash -c "munged --version ; sudo munge -n | unmunge ; hostname ; cat \"/proc/\$(systemctl show munge --property=MainPID --value)/maps\" | grep 'libmunge.so'"
Updating the MUNGE key is optional, but recommended if you think your key has been compromised. You have the following options:
- Stop the compute fleet
pcluster update-compute-fleet \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--status STOP_REQUESTED
- Stop daemons on the Head Node
sudo systemctl stop slurmctld
sudo systemctl stop slurmdbd # Optional: only if you are using Slurm Accounting
sudo systemctl stop slurmdrestd # Optional: only if you are using Slurm REST API
sudo systemctl stop munge
- Generate a new MUNGE key on the Head Node
sudo /usr/sbin/mungekey --verbose --force
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 0600 /etc/munge/munge.key
- Share the MUNGE key from the Head Node with cluster nodes
for shared_folder in "/opt/parallelcluster/shared" "/opt/parallelcluster/shared_login_nodes"; do
sudo cp -p "/etc/munge/munge.key" "${shared_folder}/.munge/.munge.key"
done
- Verify you’re sharing the right key
# Expected: no diff, perms 600 munge:munge
for shared_folder in "/opt/parallelcluster/shared" "/opt/parallelcluster/shared_login_nodes"; do
sudo diff "/etc/munge/munge.key" "${shared_folder}/.munge/.munge.key"
sudo stat -c '%a %U:%G %n' "/etc/munge/munge.key" "${shared_folder}/.munge/.munge.key"
done
- Start daemons on the Head Node
sudo systemctl start munge
sudo systemctl start slurmctld
sudo systemctl start slurmdbd # Optional: only if you are using Slurm Accounting
sudo systemctl start slurmdrestd # Optional: only if you are using Slurm REST API
- Start the compute-fleet
pcluster update-compute-fleet \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--status START_REQUESTED
- If you have login nodes, terminate all of them, so that fresh login nodes will be provisioned using the new MUNGE key.
- Generate a new MUNGE key body from the Head Node, or whatever linux host
dd if=/dev/random bs=128 count=1 2>/dev/null | base64 -w 0
- Create a new MUNGE key on Secrets Manager and take note of the ARN
aws secretsmanager create-secret \
--region <REGION> \
--name <SECRET_NAME> \
--secret-string <GENERATED_KEY> \
--query "ARN" \
--output text
- Stop the compute fleet
pcluster update-compute-fleet \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--status STOP_REQUESTED
- Update the cluster config MungeKeySecretArn to use the new secret
Scheduling:
Scheduler: slurm
SlurmSettings:
MungeKeySecretArn: <NEW_MUNGE_KEY_SECRET_ARN>
- Update cluster
pcluster update-cluster \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--cluster-configuration config.yaml
- Wait for the cluster status to be
UPDATE_COMPLETE
pcluster describe-cluster \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--query 'clusterStatus'
- Start the compute-fleet
pcluster update-compute-fleet \
--region <REGION> \
--cluster-name <CLUSTER_NAME> \
--status START_REQUESTED
Verify MUNGE is working
# Expected:
# munge-0.5.18 (2026-02-10)
# STATUS: Success (0)
# hostname of your compute node
srun -w <NEW_NODE> bash -c "munged --version ; sudo munge -n | unmunge ; hostname"