Skip to content

Conversation

@himani2411
Copy link
Contributor

@himani2411 himani2411 commented Feb 10, 2025

Description of changes

Changing the Create and Update Path of Compute Nodes in a cluster as we need to remove usage of cfn-init due to CFN API throttling issues and improve fleet scaling time.

For Create Path we revert to a approach we used in ParallelCluster 3.8.0

  • In CLI will rely on using cloud native approach of using write_files directives as part of user_data.sh ( Changes in [Scaling] Removing usage of cfn-init for compute fleet aws-parallelcluster#6655)
  • In Cookbook, we add changes which we use/excute during Updation of the node
    * create shared sub-directory /opt/parallelcluster/shared/dna which is used for storing dna.json
    * create script /opt/parallelcluster/scripts/share_compute_fleet_dna.py which is executed only by root user on HeadNode
    * Create /opt/parallelcluster/scripts/cfn-hup-update-action.sh which is executed only by root user on Compute node . This script monitors the shared /dna directory and runs cookbook Update recipes on the node.

For Update Path we will rely on HeadNode to share dna.json and extra.json for each node as per their Launch Templates
using EC2 DescribeLaunchTemplateVersions API

  • In CLI, the CFN Launch templates are updated.
  • In Cookbook,
    * HeadNode run a new script to get latest dna.json and store it in shared directory ( as part of fetch_dna_files resource )
    * cfn-hup invokes an update hook action script which monitors the shared directory for checking latest dna.json files and runs cookbook update recipes.

Dependent on CLI aws/aws-parallelcluster#6655

Tests

Same as aws/aws-parallelcluster#6655

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

import logging
from retrying import retry

SHARED_LOCATION = "/opt/parallelcluster/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the shared folder is defined by the cookbook attribute default['cluster']['shared_dir'] = "#{node['cluster']['base_dir']}/shared" , so it must be set by the cookbook attribute.
I'm aware that elsewhere in the cookbook we did not comply with this best practice, but since we are here, let's do it the best way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant without converting it into erb file and if I do that I wont be able to write python unit tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making is a script argument passed when it is invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future iterations I want to run the script for LoginNodes too and I dont see the point in re-surfacing this value as an argument for both Login and Compute Nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for doing that is because the cluster creation is supposed to succeed even if I change this path via a custom cookbook attribute. That said, I agree we can address this in a follow up Pr because we have many other scripts were we hard wired this path rather than taking it from the attribtues, so you're not introducing any regression.

@himani2411 himani2411 force-pushed the develop-cfn-init-remove branch 6 times, most recently from 2ede568 to a89978c Compare February 19, 2025 17:06
gmarciani
gmarciani previously approved these changes Feb 19, 2025
Himani Anil Deshpande added 8 commits February 19, 2025 14:37
* Separate cfn-hup update hook for ComputeFleet
* Add `get_compute_user_data.py` script to parse and get LaunchTemplates and parse them to write relevant DNA files.
* Add invocation of script get_compute_user_data.py by headNode during an update
* Writing dna.json files for each Launch template
* Using launch template logical id for update action script
* Update cfn-hup hook action script for Compute
* chnage the owner, group and mode of dna and extra files in tmp
* Share extra.json to Compute nodes
* adding cleanup operation after an update
* Update config_cfn_hup to be streamlined for node-specific configuration files
…d cleaning up dna.json and extra.json during an update

* Renaming the files and folders to cfn_hup_configuration

* Deleting old recipie config_cfn_hup_spec.rb
@himani2411 himani2411 force-pushed the develop-cfn-init-remove branch from a89978c to a9bb8a6 Compare February 19, 2025 19:37
@himani2411 himani2411 enabled auto-merge (squash) February 19, 2025 19:50
@himani2411 himani2411 added skip-security-exclusions-check Skip the checks regarding the security exclusions skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion. labels Feb 19, 2025
@himani2411 himani2411 force-pushed the develop-cfn-init-remove branch 2 times, most recently from 64f6bba to 177c975 Compare February 19, 2025 20:36
* Correcting Kitchen and Unit tests
* Adding share_compute_fleet_dna.py for tox checks
@himani2411 himani2411 force-pushed the develop-cfn-init-remove branch from 177c975 to 016d25d Compare February 19, 2025 20:51
@himani2411
Copy link
Contributor Author

himani2411 commented Feb 19, 2025

Added # nosec B108 in share_compute_fleet_dna.py and test_share_compute_fleet_dna.py as need to parse through UserData to extract /tmp/dna.json

Same reason for adding skip-* labels in this PR

@himani2411 himani2411 merged commit c2541d5 into aws:develop Feb 19, 2025
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion. skip-security-exclusions-check Skip the checks regarding the security exclusions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants