[Scaling] Remove usage of cfn-init in Compute Fleet #2875

himani2411 · 2025-02-10T16:07:38Z

Description of changes

Changing the Create and Update Path of Compute Nodes in a cluster as we need to remove usage of cfn-init due to CFN API throttling issues and improve fleet scaling time.

For Create Path we revert to a approach we used in ParallelCluster 3.8.0

In CLI will rely on using cloud native approach of using write_files directives as part of user_data.sh ( Changes in [Scaling] Removing usage of cfn-init for compute fleet aws-parallelcluster#6655)
In Cookbook, we add changes which we use/excute during Updation of the node
* create shared sub-directory /opt/parallelcluster/shared/dna which is used for storing dna.json
* create script /opt/parallelcluster/scripts/share_compute_fleet_dna.py which is executed only by root user on HeadNode
* Create /opt/parallelcluster/scripts/cfn-hup-update-action.sh which is executed only by root user on Compute node . This script monitors the shared /dna directory and runs cookbook Update recipes on the node.

For Update Path we will rely on HeadNode to share dna.json and extra.json for each node as per their Launch Templates
using EC2 DescribeLaunchTemplateVersions API

In CLI, the CFN Launch templates are updated.
In Cookbook,
* HeadNode run a new script to get latest dna.json and store it in shared directory ( as part of fetch_dna_files resource )
* cfn-hup invokes an update hook action script which monitors the shared directory for checking latest dna.json files and runs cookbook update recipes.

Dependent on CLI aws/aws-parallelcluster#6655

Tests

Same as aws/aws-parallelcluster#6655

References

Link to impacted open issues.
Link to related PRs in other packages (i.e. cookbook, node).
Link to documentation useful to understand the changes.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py

cookbooks/aws-parallelcluster-environment/recipes/config.rb

cookbooks/aws-parallelcluster-environment/resources/cfn_hup_configuration.rb

cookbooks/aws-parallelcluster-environment/test/controls/cfn_hup_configuration_spec.rb

cookbooks/aws-parallelcluster-environment/resources/cfn_hup_configuration.rb

...oks/aws-parallelcluster-environment/templates/cfn_hup_configuration/cfn-hook-update.conf.erb

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py

cookbooks/aws-parallelcluster-platform/resources/fetch_dna_files.rb

gmarciani · 2025-02-17T16:42:20Z

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py

+import logging
+from retrying import retry
+
+SHARED_LOCATION = "/opt/parallelcluster/"


the shared folder is defined by the cookbook attribute default['cluster']['shared_dir'] = "#{node['cluster']['base_dir']}/shared" , so it must be set by the cookbook attribute.
I'm aware that elsewhere in the cookbook we did not comply with this best practice, but since we are here, let's do it the best way.

I cant without converting it into erb file and if I do that I wont be able to write python unit tests

What about making is a script argument passed when it is invoked?

In future iterations I want to run the script for LoginNodes too and I dont see the point in re-surfacing this value as an argument for both Login and Compute Nodes

The reason for doing that is because the cluster creation is supposed to succeed even if I change this path via a custom cookbook attribute. That said, I agree we can address this in a follow up Pr because we have many other scripts were we hard wired this path rather than taking it from the attribtues, so you're not introducing any regression.

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py

...luster-environment/templates/cfn_hup_configuration/ComputeFleet/cfn-hup-update-action.sh.erb

cookbooks/aws-parallelcluster-platform/resources/fetch_dna_files.rb

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py

* Separate cfn-hup update hook for ComputeFleet * Add `get_compute_user_data.py` script to parse and get LaunchTemplates and parse them to write relevant DNA files. * Add invocation of script get_compute_user_data.py by headNode during an update * Writing dna.json files for each Launch template * Using launch template logical id for update action script * Update cfn-hup hook action script for Compute * chnage the owner, group and mode of dna and extra files in tmp * Share extra.json to Compute nodes * adding cleanup operation after an update * Update config_cfn_hup to be streamlined for node-specific configuration files

…d cleaning up dna.json and extra.json during an update * Renaming the files and folders to cfn_hup_configuration * Deleting old recipie config_cfn_hup_spec.rb

…r access * Add Proxy if being used.

* Correcting Kitchen and Unit tests * Adding share_compute_fleet_dna.py for tox checks

himani2411 · 2025-02-19T21:05:45Z

Added # nosec B108 in share_compute_fleet_dna.py and test_share_compute_fleet_dna.py as need to parse through UserData to extract /tmp/dna.json

Same reason for adding skip-* labels in this PR

himani2411 added skip-changelog-update 3.x labels Feb 10, 2025

himani2411 requested review from a team as code owners February 10, 2025 16:07

github-advanced-security bot found potential problems Feb 10, 2025

View reviewed changes

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Fixed Show fixed Hide fixed

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Feb 10, 2025

View reviewed changes

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Fixed Show fixed Hide fixed

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Fixed Show fixed Hide fixed

himani2411 mentioned this pull request Feb 10, 2025

[Scaling] Removing usage of cfn-init for compute fleet aws/aws-parallelcluster#6655

Merged

gmarciani reviewed Feb 17, 2025

View reviewed changes

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Outdated Show resolved Hide resolved

gmarciani reviewed Feb 17, 2025

View reviewed changes

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Outdated Show resolved Hide resolved

gmarciani reviewed Feb 17, 2025

View reviewed changes

...luster-environment/templates/cfn_hup_configuration/ComputeFleet/cfn-hup-update-action.sh.erb Outdated Show resolved Hide resolved

cookbooks/aws-parallelcluster-platform/resources/fetch_dna_files.rb Show resolved Hide resolved

gmarciani reviewed Feb 17, 2025

View reviewed changes

cookbooks/aws-parallelcluster-environment/files/cfn_hup_configuration/get_compute_user_data.py Outdated Show resolved Hide resolved

himani2411 force-pushed the develop-cfn-init-remove branch 6 times, most recently from 2ede568 to a89978c Compare February 19, 2025 17:06

gmarciani previously approved these changes Feb 19, 2025

View reviewed changes

Himani Anil Deshpande added 8 commits February 19, 2025 14:37

[UnitTest] Add unit test for cfn_hup_configuration resource

988e583

Adding fetch_dna_files resource for HeadNode invocation of sharing an…

f44b38c

…d cleaning up dna.json and extra.json during an update * Renaming the files and folders to cfn_hup_configuration * Deleting old recipie config_cfn_hup_spec.rb

Remove usage of get_compute_user_data.py from fetch_config resource

a644f17

Remove usage of Login Node from get_compute_user_data.py

879de41

[Unit Test] Unit test for fetch_dna_files resource

5498c50

[Kitchen Test] Test for cfn_hup_configuration resource

2eff925

Make cfn-hup-update-action.sh executable Only by root

ba1e074

Himani Anil Deshpande added 3 commits February 19, 2025 14:37

Explicit return statement at end of the function

c8d8cb6

Change permission for cfn-hup files and directories for only root use…

ea3c566

…r access * Add Proxy if being used.

Add unit test for share_compute_fleet_dna.py

adc3aff

himani2411 dismissed gmarciani’s stale review via a9bb8a6 February 19, 2025 19:37

himani2411 force-pushed the develop-cfn-init-remove branch from a89978c to a9bb8a6 Compare February 19, 2025 19:37

himani2411 enabled auto-merge (squash) February 19, 2025 19:50

himani2411 added skip-security-exclusions-check Skip the checks regarding the security exclusions skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion. labels Feb 19, 2025

himani2411 force-pushed the develop-cfn-init-remove branch 2 times, most recently from 64f6bba to 177c975 Compare February 19, 2025 20:36

Cookstyle and code-linters corrections

016d25d

* Correcting Kitchen and Unit tests * Adding share_compute_fleet_dna.py for tox checks

himani2411 force-pushed the develop-cfn-init-remove branch from 177c975 to 016d25d Compare February 19, 2025 20:51

gmarciani approved these changes Feb 19, 2025

View reviewed changes

himani2411 merged commit c2541d5 into aws:develop Feb 19, 2025
28 of 30 checks passed

[Scaling] Remove usage of cfn-init in Compute Fleet #2875

[Scaling] Remove usage of cfn-init in Compute Fleet #2875

Uh oh!

Conversation

himani2411 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

References

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmarciani Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

himani2411 Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

gmarciani Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

himani2411 Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

gmarciani Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

himani2411 commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

himani2411 commented Feb 10, 2025 •

edited

Loading

himani2411 commented Feb 19, 2025 •

edited

Loading