-
Notifications
You must be signed in to change notification settings - Fork 314
(3.9.0 ‐ 3.14.0) Performance degradation on tightly coupled workloads at scale
Starting ParallelCluster 3.9.0, some performance degradation can occur on tightly coupled MPI workloads on large clusters. The root cause is that in order to execute in-place cluster updates on compute and login nodes, which allowed for the mounting/unmounting of shared storage without replacing the nodes, we introduced a process supporting in-place updates on the compute nodes. Even if the process is lightweight, it is run periodically and may affect the performance of some specific workloads.
All ParallelCluster versions from 3.9.0 to 3.14.0 on all OSes.
This mitigation use a custom cookbook that disables the process that periodically checks for updates on compute nodes to prevent the performance degradation. This custom cookbook is only supported for ParallelCluster version 3.14.0.
Please note that this mitigation disables in-place dynamic file system mounting. This means that a cluster update with changes to the shared storage will not be applied to running compute/login nodes, but would instead be applied as any other cluster update, i.e. according to the selected QueueUpdateStrategy.
The following steps should be followed to resolve this issue:
- Upgrade to ParallelCluster version 3.14.0
- Use a custom cookbook the component that supports in place updates by adding the following to your cluster configuration file:
DevSettings:
Cookbook:
ChefCookbook: https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/patches/3.14.0/disable-cfnhup-on-compute-nodes/aws-parallelcluster-cookbook-3.14.0.tgz