From dcf3714b4b80782c1952f21e172fbcf5472ad51c Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Tue, 2 Sep 2025 14:00:15 -0400 Subject: [PATCH 1/4] [3.14.0][Changelog] Address formatting/wording issue in 3.14.0 CLI changelog followup --- CHANGELOG.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f16c400b74..e5444d7c5f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,25 +4,19 @@ CHANGELOG 3.14.0 ------ -**DEPRECATIONS** -- The configuration parameter `LoginNodes/Pools/Ssh/KeyName` has been deprecated. The CLI now returns a warning message when it is used in the cluster configuration. - See https://github.com/aws/aws-parallelcluster/issues/6811. - **ENHANCEMENTS** -- Add support for p6e-gb200 instances via capacity blocks. -- Echo chef-client log when a node fails to bootstrap. This helps with investigating bootstrap failures in cases CloudWatch logs are not available. +- Add basic support for p6e-gb200 instances via capacity blocks. AL2 is not supported and IMEX configuration must be done manually. +- Chef-client logs are now available in the instance console to help investigating node bootstrap failures in cases CloudWatch logs are not available. - Add `build-image` support for kernel 6.12 of Amazon Linux 2023. The official ParallelCluster Amazon Linux 2023 AMIs use kernel 6.12. - Support `prioritized` and `capacity-optimized-prioritized` Allocation Strategy. This allows users to prioritize subnets for instance placement to optimize costs and performance. +- Support DCV on Amazon Linux 2023. **CHANGES** - Install nvidia-imex for all OSs except AL2. - Ubuntu 20.04 is no longer supported. - Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. -- Support DCV on Amazon Linux 2023. - Upgrade Python runtime used by Lambda functions to python3.12 (from python3.9). - Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. -- The build-image command now deploys a global role that is used to automatically delete the build-image stack after images either succeed or fail the build. - The role is meant to exist even after the stack has been deleted. This is to prevent build-image stack deletion failures, reported in https://github.com/aws/aws-parallelcluster/issues/5914. - Add the configuration parameter `HeadNode/SharedStorageEfsSettings/Encrypted` to enable encryption on the EFS file system used for the head node internal shared storage. - Add validator that warns against using non GPU instances with DCV. - Upgrade Slurm to version 24.11.6 (from 24.05.8). @@ -38,14 +32,22 @@ CHANGELOG - Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2. - Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2. - Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2. +- Upgrade Python to 3.9.23 (from 3.9.20) for AL2. - Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1). +- Upgrade DCV to version 2024.0-19030. **BUG FIXES** +- The `build-image` command now deploys a global role that is used to automatically delete the build-image stack after images either succeed or fail the build. + The role is meant to exist even after the stack has been deleted. This is to prevent build-image stack deletion failures, reported in https://github.com/aws/aws-parallelcluster/issues/5914. - Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs). - Fix `build-image` failure on Rocky 9, occurring when the parent image does not ship the latest kernel version on the latest Rocky minor version. - Fix AWS Batch cluster creation failures in China when the OS is Amazon Linux 2023. - Fix cluster id mismatch issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting. +**DEPRECATIONS** +- The configuration parameter `LoginNodes/Pools/Ssh/KeyName` has been deprecated, and it will be removed in future releases. The CLI now returns a warning message when it is used in the cluster configuration. + See https://github.com/aws/aws-parallelcluster/issues/6811. + 3.13.2 ------ From 18864ba78683db0ee904bb07b89fb3cd15fc323d Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Wed, 3 Sep 2025 15:52:33 -0400 Subject: [PATCH 2/4] Add LIMITATIONS section, make p6e-gb200 support info more detailed, move ubuntu2004 not support info to DEPRECATION section. --- CHANGELOG.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index e5444d7c5f..57990e034d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,15 +5,18 @@ CHANGELOG ------ **ENHANCEMENTS** -- Add basic support for p6e-gb200 instances via capacity blocks. AL2 is not supported and IMEX configuration must be done manually. -- Chef-client logs are now available in the instance console to help investigating node bootstrap failures in cases CloudWatch logs are not available. +- Support for P6e-GB200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements. +- Echo chef-client logs in the instance console when a node fails to bootstrap. This helps with investigating bootstrap failures in cases CloudWatch logs are not available. - Add `build-image` support for kernel 6.12 of Amazon Linux 2023. The official ParallelCluster Amazon Linux 2023 AMIs use kernel 6.12. - Support `prioritized` and `capacity-optimized-prioritized` Allocation Strategy. This allows users to prioritize subnets for instance placement to optimize costs and performance. - Support DCV on Amazon Linux 2023. +**LIMITATIONS** +- P6e-GB200 instances are only tested on Amazon Linux 2023, Ubuntu 20.04 and Ubuntu 24.04. +- Using IMEX on P6e-GB200 requires additional setup. Please refer to . + **CHANGES** - Install nvidia-imex for all OSs except AL2. -- Ubuntu 20.04 is no longer supported. - Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value. - Upgrade Python runtime used by Lambda functions to python3.12 (from python3.9). - Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management. @@ -37,8 +40,8 @@ CHANGELOG - Upgrade DCV to version 2024.0-19030. **BUG FIXES** -- The `build-image` command now deploys a global role that is used to automatically delete the build-image stack after images either succeed or fail the build. - The role is meant to exist even after the stack has been deleted. This is to prevent build-image stack deletion failures, reported in https://github.com/aws/aws-parallelcluster/issues/5914. +- Prevent `build-image` stack deletion failures by deploying a global role that automatically deletes the `build-image` stack after images either succeed or fail the build. + The role is meant to exist even after the stack has been deleted. See https://github.com/aws/aws-parallelcluster/issues/5914. - Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs). - Fix `build-image` failure on Rocky 9, occurring when the parent image does not ship the latest kernel version on the latest Rocky minor version. - Fix AWS Batch cluster creation failures in China when the OS is Amazon Linux 2023. @@ -47,6 +50,7 @@ CHANGELOG **DEPRECATIONS** - The configuration parameter `LoginNodes/Pools/Ssh/KeyName` has been deprecated, and it will be removed in future releases. The CLI now returns a warning message when it is used in the cluster configuration. See https://github.com/aws/aws-parallelcluster/issues/6811. +- Ubuntu 20.04 is no longer supported. 3.13.2 ------ From bcda34128a7fd324d15b2726ecc131b3429bd481 Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Thu, 4 Sep 2025 10:41:01 -0400 Subject: [PATCH 3/4] Address comments --- CHANGELOG.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 57990e034d..c4c9b24613 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,7 +7,7 @@ CHANGELOG **ENHANCEMENTS** - Support for P6e-GB200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements. - Echo chef-client logs in the instance console when a node fails to bootstrap. This helps with investigating bootstrap failures in cases CloudWatch logs are not available. -- Add `build-image` support for kernel 6.12 of Amazon Linux 2023. The official ParallelCluster Amazon Linux 2023 AMIs use kernel 6.12. +- Add `build-image` support for Amazon Linux 2023 AMIs based on kernel 6.12 (in addition to 6.1). - Support `prioritized` and `capacity-optimized-prioritized` Allocation Strategy. This allows users to prioritize subnets for instance placement to optimize costs and performance. - Support DCV on Amazon Linux 2023. @@ -38,6 +38,7 @@ CHANGELOG - Upgrade Python to 3.9.23 (from 3.9.20) for AL2. - Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1). - Upgrade DCV to version 2024.0-19030. +- Upgrade the official ParallelCluster Amazon Linux 2023 AMIs to kernel 6.12 (from 6.1). **BUG FIXES** - Prevent `build-image` stack deletion failures by deploying a global role that automatically deletes the `build-image` stack after images either succeed or fail the build. From a713b648d31cee35caab3e732ff74d78f75976b8 Mon Sep 17 00:00:00 2001 From: Xuanqi He Date: Thu, 4 Sep 2025 12:52:16 -0400 Subject: [PATCH 4/4] Fix an error, we tested GB200 on U22, not U20. --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c4c9b24613..72f3668123 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,7 +12,7 @@ CHANGELOG - Support DCV on Amazon Linux 2023. **LIMITATIONS** -- P6e-GB200 instances are only tested on Amazon Linux 2023, Ubuntu 20.04 and Ubuntu 24.04. +- P6e-GB200 instances are only tested on Amazon Linux 2023, Ubuntu 22.04 and Ubuntu 24.04. - Using IMEX on P6e-GB200 requires additional setup. Please refer to . **CHANGES**