Skip to content

Conversation

@hgreebe
Copy link
Contributor

@hgreebe hgreebe commented Jul 15, 2025

Description of changes

  • Covers slurm known error where cluster id saved in /var/spool/slurm.state/clustername does not match the cluster id that slurm dbd has
  • This fix needs to be in both clear_slurm_accounting and config_slurm_accounting
    • If a cluster is created without slurm accounting and then updated to have slurm accounting it will run config_slurm_accounting and have the cluster id mismatch
    • If a cluster is created with slurm accounting and then updated, it will run clear_slurm_accounting and have the cluster id mismatch
  • Example error message:
[2025-07-10T00:54:46.362] fatal: CLUSTER ID MISMATCH.
slurmctld has been started with "ClusterID=4018"  from the state files in StateSaveLocation, but the DBD thinks it should be "3073".
Running multiple clusters from a shared StateSaveLocation WILL CAUSE CORRUPTION.
Remove /var/spool/slurm.state/clustername to override this safety check if this is intentional.

Tests

  • Created a new AMI with the cookbook changes and ran the test_slurm_accounting and test_slurm integ tests.
  • Tested the if I have one job running and one job pending and then stop slurmctld and delete /var/spool/slurm.state/clustername, once I restart slurmctld, the slurm state remains the same.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hgreebe hgreebe requested review from a team as code owners July 15, 2025 15:03
@codecov
Copy link

codecov bot commented Jul 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.50%. Comparing base (6127e18) to head (3b3c60c).
Report is 11 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #2994   +/-   ##
========================================
  Coverage    75.50%   75.50%           
========================================
  Files           23       23           
  Lines         2356     2356           
========================================
  Hits          1779     1779           
  Misses         577      577           
Flag Coverage Δ
unittests 75.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani
Copy link
Contributor

if I have one job running and one job pending and then stop slurmctld and delete /var/spool/slurm.state/clustername, once I restart slurmctld, the slurm state remains the same.

Do we have an integ test capturing the same scenario when we execute the cluster update?
If not, it would be a nice to have to capture it

code <<-CLUSTERSTATE
rm /var/spool/slurm.state/clustername
CLUSTERSTATE
only_if { ::File.exist?('/var/spool/slurm.state/clustername') }
Copy link
Contributor

@gmarciani gmarciani Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use bash, why using an only_if rather than having a rm -f?
Also, why using bash and not using the file resource with action delete?

code <<-CLUSTERSTATE
rm /var/spool/slurm.state/clustername
CLUSTERSTATE
only_if { ::File.exist?('/var/spool/slurm.state/clustername') }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarciani
Copy link
Contributor

With this change we are fixing a bug. Can we surface it in the changelog?

action %i(disable stop)
end

bash "Remove existing cluster name state file" do
Copy link
Contributor

@gmarciani gmarciani Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is duplicated. What about reducing code duplication by defining this logic into a function and call that function or use the file resources to delete the file?

CHANGELOG.md Outdated
**CHANGES**
- Ubuntu 20.04 is no longer supported.
- Upgrade Slurm to version 24.11.5.
- Addressed cluster id mismatch known issue by deleting the file `/var/spool/slurm.state/clustername` before configuring slurm accounting.
Copy link
Contributor

@gmarciani gmarciani Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: extra indent, and Slurm must be capitalized

@hgreebe hgreebe enabled auto-merge (squash) July 22, 2025 11:10
@hgreebe hgreebe merged commit 59cce6b into aws:develop Jul 22, 2025
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants