Skip to content

Conversation

@gmarciani
Copy link
Contributor

Description of changes

Merge develop into release branch for release 3.13.0.

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

hanwen-cluster and others added 23 commits February 11, 2025 12:38
The information will be shown on `scontrol show nodes`:
```
NodeName=queue-on-demand-dy-compute-resource-2-2 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=4 CPUEfctv=4 CPUTot=4 CPULoad=0.95
   AvailableFeatures=dynamic,c5.xlarge,compute-resource-2
   ActiveFeatures=dynamic,c5.xlarge,compute-resource-2
   Gres=(null)
   NodeAddr=192.168.127.110 NodeHostName=queue-on-demand-dy-compute-resource-2-2 Version=24.05.2
   OS=Linux 5.10.233-224.894.amzn2.x86_64 #1 SMP Mon Jan 27 16:52:48 UTC 2025
   RealMemory=7782 AllocMem=0 FreeMem=6431 Sockets=4 Boards=1
   State=ALLOCATED+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1000 Owner=N/A MCS_label=N/A
   Partitions=queue-on-demand
   BootTime=2025-02-10T21:22:00 SlurmdStartTime=2025-02-10T21:25:05
   LastBusyTime=2025-02-10T21:25:05 ResumeAfterTime=None
   CfgTRES=cpu=4,mem=7782M,billing=4
   AllocTRES=cpu=4
   CurrentWatts=0 AveWatts=0

   InstanceId=i-0eb8d995282xxxx11 InstanceType=c5.xlarge
```

reference: https://slurm.schedmd.com/slurmd.html

Signed-off-by: Hanwen <[email protected]>
…ver.

We remove it because:
  1/ it is not supported anymore by the NVIDIA installer.
  2/ it was an unsafe workaround introduced in 3.8.0 (#2404), which was supposed to be there only in the short term and in ended be there for long time.
  3/ we introduced in 3.12.0 a logic to install NVIDIA drivers using the gcc version used to compile the kernel #2852.
* Add cfn-hup configuration resource

* Separate cfn-hup update hook for ComputeFleet
* Add `get_compute_user_data.py` script to parse and get LaunchTemplates and parse them to write relevant DNA files.
* Add invocation of script get_compute_user_data.py by headNode during an update
* Writing dna.json files for each Launch template
* Using launch template logical id for update action script
* Update cfn-hup hook action script for Compute
* chnage the owner, group and mode of dna and extra files in tmp
* Share extra.json to Compute nodes
* adding cleanup operation after an update
* Update config_cfn_hup to be streamlined for node-specific configuration files

* [UnitTest] Add unit test for cfn_hup_configuration resource

* Adding fetch_dna_files resource for HeadNode invocation of sharing and cleaning up dna.json and extra.json during an update

* Renaming the files and folders to cfn_hup_configuration

* Deleting old recipie config_cfn_hup_spec.rb

* Remove usage of get_compute_user_data.py from fetch_config resource

* Remove usage of Login Node from get_compute_user_data.py

* [Unit Test] Unit test for fetch_dna_files resource

* [Kitchen Test] Test for cfn_hup_configuration resource

* Make cfn-hup-update-action.sh executable Only by root

* Explicit return statement at end of the function

* Change permission for cfn-hup files and directories for only root user access

* Add Proxy if being used.

* Add unit test for share_compute_fleet_dna.py

* Cookstyle and code-linters corrections

* Correcting Kitchen and Unit tests
* Adding share_compute_fleet_dna.py for tox checks

---------

Co-authored-by: Himani Anil Deshpande <[email protected]>
Co-authored-by: Himani Anil Deshpande <[email protected]>
* Do not install lustre on ubuntu2404

* Fix jq spec test failure die to deprecation of --argfile
* adding Logs which would be visible in chef-client.log

Co-authored-by: Himani Anil Deshpande <[email protected]>
…FN Boostrap scripts. In particular, we expect all Oses to use Python 3.12.8, except for AL2 which uses 3.9.20.
* Revert "[Test] Fix Python version in kitchen tests related to AWS Batch and CFN Boostrap scripts. In particular, we expect all Oses to use Python 3.12.8, except for AL2 which uses 3.9.20."

This reverts commit 4574ee5.

* Revert "Upgrade python version to 3.12 for all OSs except AL2 and remove unneccesary python dep installation (#2869)"

This reverts commit 45431e6.

---------

Co-authored-by: Himani Anil Deshpande <[email protected]>
…ing DevSettings (#2883)

Co-authored-by: Himani Anil Deshpande <[email protected]>
* Upgrade NVIDIA driver and cuda version

* Update CHANGELOG

* Fix unit tests
* Upgrade cfn bootstrap

* Upgrade python version to 3.12 for all OSs except AL2

* [Test] Fix Python version in kitchen tests related to AWS Batch and CFN Boostrap scripts. In particular, we expect all Oses to use Python 3.12.8, except for AL2 which uses 3.9.20

* Update CHANGELOG
…bs with multi-user (#2893)

* Fix an issue where containerized jobs executed through Pyxis/Enroot in a multi-user environment (integrated with Active Directory) would fail due to permission error.
* Download http-parser from s3

* Fix isolated spec test
…ures on Rocky 9.5+ when directory service is used.
@gmarciani gmarciani requested review from a team as code owners March 11, 2025 22:20
@gmarciani gmarciani added the skip-security-exclusions-check Skip the checks regarding the security exclusions label Mar 11, 2025
@codecov
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 53.09735% with 53 lines in your changes missing coverage. Please review.

Please upload report for BASE (release-3.13@3a6db7c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...s/cfn_hup_configuration/share_compute_fleet_dna.py 53.09% 53 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##             release-3.13    #2898   +/-   ##
===============================================
  Coverage                ?   75.50%           
===============================================
  Files                   ?       23           
  Lines                   ?     2356           
  Branches                ?        0           
===============================================
  Hits                    ?     1779           
  Misses                  ?      577           
  Partials                ?        0           
Flag Coverage Δ
unittests 75.50% <53.09%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani closed this Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update skip-security-exclusions-check Skip the checks regarding the security exclusions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants