Skip to content

Conversation

@himani2411
Copy link
Contributor

@himani2411 himani2411 commented Aug 13, 2025

Description of changes

  • Use --force-configuration option for creating topology configuration
  • Ignore checks like Capacity Block and Matching Instance Type

Tests

  • Unit tests
  • Cluster creation with below DevSettings
DevSettings:
 Cookbook:
   ChefCookbook: https://github.com/himani2411/aws-parallelcluster-cookbook/tarball/slurm-topo-simulate
ExtraChefAttributes: |
   {"cluster":{"p6egb200_block_sizes":"2", "slurm":{"block_topology":{"force_configuration":"true"}}}}

tests shows file exist

[root@ip-192-168-11-241 ~]# cat /opt/slurm/etc/topology.conf
# This file is automatically generated by pcluster

BlockName=Block1  Nodes=queue1-st-cr1-[1-2]
BlockSizes=2
[root@ip-192-168-11-241 ~]# cat /opt/slurm/etc/slurm.conf | grep include
# WARNING!!! The slurm_parallelcluster.conf file included below can be updated by the pcluster process.
include slurm_parallelcluster.conf
# WARNING!!! The custom_slurm_settings_include_file_slurm.conf file included below can be updated by the pcluster process.
include pcluster/custom_slurm_settings_include_file_slurm.conf
# WARNING!!! The slurm_parallelcluster_topology.conf file included below can be updated by the pcluster process.
include slurm_parallelcluster_topology.conf
[root@ip-192-168-11-241 ~]#
[root@ip-192-168-11-241 ~]# cat /opt/slurm/etc/slurm_parallelcluster_topology.conf
# slurm_parallelcluster_topology.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options
# TOPOLOGY Plugin
TopologyPlugin=topology/block
[root@ip-192-168-11-241 ~]#
[root@ip-192-168-11-241 ~]# scontrol show topology
BlockName=Block1 BlockIndex=0 Nodes=queue1-st-cr1-[1-2] BlockSize=2
[root@ip-192-168-11-241 ~]#
  1. Force Update the cluster to add another queue with one static node
DevSettings:
  Cookbook:
    ChefCookbook: https://github.com/himani2411/aws-parallelcluster-cookbook/tarball/slurm-topo-simulate
ExtraChefAttributes: |
    {"cluster":{"p6egb200_block_sizes":"1,2", "slurm":{"block_topology":{"force_configuration":"true"}}}}

Tests to check files are updated.

[root@ip-192-168-11-241 etc]# cat topology.conf
# This file is automatically generated by pcluster

BlockName=Block1  Nodes=queue1-st-cr1-[1-2]
BlockName=Block2  Nodes=queue2-st-cr2-[1-1]
BlockSizes=1,2
[root@ip-192-168-11-241 etc]# scontrol show topology
BlockName=Block1 BlockIndex=0 Nodes=queue1-st-cr1-[1-2] BlockSize=1
BlockName=Block2 BlockIndex=1 Nodes=queue2-st-cr2-[1-1] BlockSize=1
AggregatedBlock=Block[1-2] BlockIndex=2 Nodes=queue1-st-cr1-[1-2],queue2-st-cr2-1 BlockSize=2
[root@ip-192-168-11-241 etc]#cat slurm_parallelcluster_topology.conf
# slurm_parallelcluster_topology.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options
# TOPOLOGY Plugin
TopologyPlugin=topology/block
[root@ip-192-168-11-241 etc]#
  1. Force Update to remove force configuration from DevSettings and no changes in queue ( so it should update the file and mess up the topology)
DevSettings:
  Cookbook:
    ChefCookbook: https://github.com/himani2411/aws-parallelcluster-cookbook/tarball/slurm-topo-simulate
ExtraChefAttributes: |
    {"cluster":{"p6egb200_block_sizes":"1,2"}}

and chnages in the files

 cat topology.conf
# This file is automatically generated by pcluster

BlockSizes=1,2

[root@ip-192-168-11-241 etc]# scontrol show topology
scontrol: error: No topology information available
[root@ip-192-168-11-241 etc]#

  1. force Update to remove the ExtraChefAttributes and update queues to MinCount to 0(queues are updated) see that the files are deleted.
DevSettings:
  Cookbook:
    ChefCookbook: https://github.com/himani2411/aws-parallelcluster-cookbook/tarball/slurm-topo-simulate

and see that files are deleted/cleaned up.

[root@ip-192-168-11-241 etc]# ls -al
total 72
drwxr-xr-x  4 root root 16384 Aug 13 21:14 .
drwxr-xr-x. 9 root root    92 Aug 13 20:43 ..
-rw-r--r--  1 root root   249 Aug 13 20:43 cgroup.conf
-rw-r--r--  1 root root   174 Aug 13 20:43 gres.conf
drwxr-xr-x  3 root root 16384 Aug 13 20:52 pcluster
drwxr-xr-x  4 root root    38 Aug 13 20:43 scripts
-rw-r--r--  1 root root  2300 Aug 13 20:43 slurm.conf
-rwxr-xr-x  1 root root   233 Aug 13 20:43 slurm.csh
-rwxr-xr-x  1 root root   140 Aug 13 20:43 slurm.sh
-rw-r--r--  1 root root   476 Aug 13 21:14 slurm_parallelcluster.conf
-rw-r--r--  1 root root   155 Aug 13 21:14 slurm_parallelcluster_cgroup.conf
-rw-r--r--  1 root root   232 Aug 13 21:14 slurm_parallelcluster_gres.conf
-rw-r--r--  1 root root   168 Aug 13 21:14 slurm_parallelcluster_slurmdbd.conf
-rw-r--r--  1 root root   237 Aug 13 21:05 slurm_parallelcluster_topology.conf
[root@ip-192-168-11-241 etc]# cat slurm.conf | grep 'include'
# WARNING!!! The slurm_parallelcluster.conf file included below can be updated by the pcluster process.
include slurm_parallelcluster.conf
# WARNING!!! The custom_slurm_settings_include_file_slurm.conf file included below can be updated by the pcluster process.
include pcluster/custom_slurm_settings_include_file_slurm.conf
# WARNING!!! The slurm_parallelcluster_topology.conf file included below can be updated by the pcluster process.
include slurm_parallelcluster_topology.conf
[root@ip-192-168-11-241 etc]# cat slurm_parallelcluster_topology.conf
# slurm_parallelcluster_topology.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options
# TOPOLOGY Plugin
[root@ip-192-168-11-241 etc]# scontrol show topology
[root@ip-192-168-11-241 etc]#

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested a review from a team as a code owner August 13, 2025 20:31
@himani2411 himani2411 added the 3.x label Aug 13, 2025
@himani2411 himani2411 requested a review from a team as a code owner August 13, 2025 20:31
…conf

* Ignore checks like Capacity Block and Matching Instance Type
@himani2411 himani2411 force-pushed the slurm-topo-simulate branch from 5a3d89f to d5476c7 Compare August 13, 2025 21:20
* We do not update the file if queues are not updated and p6egb200_block_sizes does not exist
@codecov
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.22%. Comparing base (6127e18) to head (e8a8c18).
⚠️ Report is 24 commits behind head on develop.

Files with missing lines Patch % Lines
...ad_node_slurm/slurm/pcluster_topology_generator.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3008      +/-   ##
===========================================
- Coverage    75.50%   75.22%   -0.29%     
===========================================
  Files           23       24       +1     
  Lines         2356     2446      +90     
===========================================
+ Hits          1779     1840      +61     
- Misses         577      606      +29     
Flag Coverage Δ
unittests 75.22% <85.71%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

not_if { ::File.exist?(node['cluster']['previous_cluster_config_path']) && topology_generator_command_args.nil? }
"#{topology_generator_command_args}"\
"#{topology_generator_extra_args}"
not_if { topology_generator_command_args.nil? }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the previous_cluster_config_path check was removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont need to check the existence of this file. And this is the check which is causing the errors in our integ tests as ['cluster']['previous_cluster_config_path'] will exist in every update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a followup PR to solve the update cluster issue as this doesnt solve it

@himani2411 himani2411 force-pushed the slurm-topo-simulate branch from e8a8c18 to 94d2a72 Compare August 14, 2025 20:43
@himani2411 himani2411 merged commit 2d4cf30 into aws:develop Aug 14, 2025
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants