-
Notifications
You must be signed in to change notification settings - Fork 318
Description
Parallelcluster 3.11.1
User is trying to launch a job using srun and receives an error:
$ srun --time=08:00:00 --job-name=redacted --cpus-per-task 8 --mem=0 --partition=od-128-gb --exclusive --pty --x11 -L NeededLicense ./run.sh
srun: error: Node failure on od-r7i-4xl-dy-od-128-gb-8-cores-4
srun: error: Nodes od-r7i-4xl-dy-od-128-gb-8-cores-4 are still not ready
srun: error: Something is wrong with the boot of the nodes.
In reviewing slumrctld.log I see the following:
2025-04-05 13:53:24,125 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2025-04-05 13:53:24,127 - [slurm_plugin.slurm_resources:is_backing_instance_valid] - WARNING - Node state check: no corresponding instance in EC2 for node od-r7i-4xl-dy-od-128-gb-8-cores-4(10.6.11.250), node state: ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP
2025-04-05 13:53:24,127 - [slurm_plugin.slurm_resources:is_backing_instance_valid] - WARNING - EC2 instance availability for node od-r7i-4xl-dy-od-128-gb-8-cores-4 has timed out.
2025-04-05 13:53:24,129 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2025-04-05 13:53:24,131 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['od-r7i-4xl-dy-od-128-gb-8-cores-4(10.6.11.250)']
2025-04-05 13:53:24,131 - [slurm_plugin.clustermgtd:_handle_unhealthy_dynamic_nodes] - INFO - Setting unhealthy dynamic nodes to down and power_down.
2025-04-05 13:53:24,174 - [slurm_plugin.slurm_resources:is_bootstrap_failure] - WARNING - Node bootstrap error: Node od-r7i-4xl-dy-od-128-gb-8-cores-4(10.6.11.250) is in power up state without valid backing instance, node state: ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP
2025-04-05 13:53:24,176 - [slurm_plugin.clustermgtd:_handle_bootstrap_failure_nodes] - WARNING - Found the following bootstrap failure nodes: (x1) ['od-r7i-4xl-dy-od-128-gb-8-cores-4(10.6.11.250)']
2025-04-05 13:53:24,176 - [slurm_plugin.clustermgtd:_handle_protected_mode_process] - INFO - Partitions bootstrap failure count: {'od-r7i-4xl': {'od-128-gb-8-cores': 1}}, cluster will be set into protected mode if protected failure count reaches threshold 10
Looks like the instance had some issue starting up.
The parallecluster/slurm_resume.log provides no insights.
In reviewing the AWS EC2 console I can find the instanced and it just shows Terminated.
Is it possible that there was no such instance available for on-demand startup? Hard to believe on a Saturday afternoon? And if there was no such instance, why did something seem to get assigned and even get an IP address?
Could this be an issue with the scheduler expecting the instance to be up in a certain amount of time yet the machine is taking longer to boot/start than expected but the scheduler gives up (what is the scheduler expectation of when the node is up?).
This is the second or third time I've seen this error in the past few months- any guidance on how to debug this or where to look?