Skip to content

Conversation

@hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Mar 24, 2025

Description of changes

  • Add sudo NOPASSWD permission to pcluster-admin to run killall command, fix the below error:
2025-03-24 19:52:57,641 - [slurm_plugin.computemgtd:_self_terminate] - INFO - Killing slurm processes
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required
2025-03-24 19:52:57,649 - [common.utils:_run_command] - ERROR - Command '['sudo', 'killall', '-9', '--quiet', 'slurmd', 'slurmstepd']' returned non-zero exit status 1.2025-03-24 19:52:57,649 - [slurm_plugin.computemgtd:wrapper] - ERROR - Failed when self terminating compute instance with exception CalledProcessError, message: Command '['sudo', 'killall', '-9', '--quiet', 'slurmd', 'slurmstepd']' returned non-zero exit status 1.

Tests

  • Manually test:
  1. Increase slurmdtimeout to 10000s.
  2. SSH into a normal static node, add sudo NOPASSWD permission to pcluster-admin to run killall command in /etc/sudoers.d/sudo99-parallelcluster-slurm
  3. Load custom node package in PR: [Shutdown] Kill slurmd and slurmstepd before shutting aws-parallelcluster-node#656
  4. In headnode submit kill network job: sbatch -w q1-st-cr1-1 -a 1-1 --exclusive --no-requeue --wrap='sudo ifconfig ens5 down && sleep 600'
  5. Expected to see node shutdown after 10 mins timeout and enter DOWN state in 20 mins.

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant