Skip to content

(3.9.0 ‐ 3.14.0) Performance degradation on tightly coupled workloads at scale

hgreebe edited this page Nov 12, 2025 · 1 revision

The issue

Starting ParallelCluster 3.9.0, some performance degradation can occur on tightly coupled MPI workloads on large clusters. The root cause is that in order to execute in-place cluster updates on compute and login nodes, which allowed for the mounting/unmounting of shared storage without replacing the nodes, we introduced a process supporting in-place updates on the compute nodes. Even if the process is lightweight, it is run periodically and may affect the performance of some specific workloads.

Affected ParallelCluster versions, OSes and schedulers

All ParallelCluster versions from 3.9.0 to 3.14.0 on all OSes.

Mitigation

This mitigation use a custom cookbook that disables the process that periodically checks for updates on compute nodes to prevent the performance degradation. This custom cookbook is only supported for ParallelCluster version 3.14.0.

Please note that this mitigation disables in-place dynamic file system mounting. This means that a cluster update with changes to the shared storage will not be applied to running compute/login nodes, but would instead be applied as any other cluster update, i.e. according to the selected QueueUpdateStrategy.

The following steps should be followed to resolve this issue:

  1. Upgrade to ParallelCluster version 3.14.0
  2. Use a custom cookbook the component that supports in place updates by adding the following to your cluster configuration file:
DevSettings:
  Cookbook:
    ChefCookbook: https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/patches/3.14.0/disable-cfnhup-on-compute-nodes/aws-parallelcluster-cookbook-3.14.0.tgz
Clone this wiki locally