Skip to content

Commit dfbdfa9

Browse files
authored
Add integration test for scontrol update nodelist sorting bug (#4785)
Add integration test for scontrol update nodelist sorting bug Slurm 22.05 (up to 22.05.7) sorts the nodes in nodelist provided as nodename field to the `scontrol update` command. If `scontrol update nodename=nodelist nodeaddr=nodeaddrlist` is given, this causes a mismatch between the nodenames and the nodeaddrs because the order of nodeaddrlist is not changed to match the reordering of nodelist. The bug above will be fixed in Slurm 22.05.8 and later. Signed-off-by: Jacopo De Amicis <[email protected]>
1 parent 4c9d78a commit dfbdfa9

File tree

4 files changed

+94
-0
lines changed

4 files changed

+94
-0
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
CHANGELOG
22
=========
3+
4+
3.4.1
5+
-----
6+
7+
**BUG FIXES**
8+
- Fix an issue with the Slurm scheduler that might incorrectly apply updates to its internal registry of compute nodes. This might result in EC2 instances to become inaccessible or backed by an incorrect instance type.
9+
310
3.4.0
411
-----
512

tests/integration-tests/configs/common/common.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -511,6 +511,12 @@ schedulers:
511511
instances: {{ common.INSTANCES_DEFAULT_X86 }}
512512
oss: ["ubuntu2004"]
513513
schedulers: ["slurm"]
514+
test_slurm.py::test_scontrol_update_nodelist_sorting:
515+
dimensions:
516+
- regions: ["ca-central-2"]
517+
instances: {{ common.INSTANCES_DEFAULT_X86 }}
518+
oss: ["alinux2"]
519+
schedulers: ["slurm"]
514520
test_slurm_accounting.py::test_slurm_accounting:
515521
dimensions:
516522
- regions: ["us-east-1", "ap-south-1"]

tests/integration-tests/tests/schedulers/test_slurm.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -560,6 +560,56 @@ def test_update_slurm_reconfigure_race_condition(
560560
)
561561

562562

563+
@pytest.mark.usefixtures("region", "os", "instance", "scheduler")
564+
def test_scontrol_update_nodelist_sorting(
565+
pcluster_config_reader,
566+
clusters_factory,
567+
test_datadir,
568+
scheduler_commands_factory,
569+
):
570+
"""
571+
Test that scontrol update node follows the order of the nodelist provided by the user.
572+
573+
In Slurm 22.05 the scontrol update node logic was modified and a sorting routine was
574+
introduced, which modified the order of the nodes in the nodelist.
575+
If `scontrol update node nodename=nodelist nodeaddr=nodeaddrlist` is called, only the
576+
nodelist was sorted (not the nodeaddrlist). This causes mismatches between the Slurm
577+
nodenames and the assigned addresses.
578+
579+
See https://bugs.schedmd.com/show_bug.cgi?id=15731
580+
"""
581+
582+
max_count_cr1 = max_count_cr2 = 4
583+
584+
cluster_config = pcluster_config_reader(
585+
config_file="pcluster.config.yaml",
586+
output_file="pcluster.config.initial.yaml",
587+
max_count_cr1=max_count_cr1,
588+
max_count_cr2=max_count_cr2,
589+
)
590+
cluster = clusters_factory(cluster_config)
591+
remote_command_executor = RemoteCommandExecutor(cluster)
592+
slurm_commands = scheduler_commands_factory(remote_command_executor)
593+
594+
assert_compute_node_states(slurm_commands, compute_nodes=None, expected_states=["idle~"])
595+
596+
nodes_in_queue1 = slurm_commands.get_compute_nodes("queue1", all_nodes=True)
597+
nodes_in_queue2 = slurm_commands.get_compute_nodes("queue2", all_nodes=True)
598+
599+
# Create an unsorted list of nodes to be updated (queue2 is alphabetically after queue1)``:s
600+
nodelist = f"{nodes_in_queue2[0]},{nodes_in_queue1[0]}"
601+
602+
# Stop clustermgtd since it may fix the situation under the hood if it calls scontrol update
603+
# with a sorted list of nodes
604+
remote_command_executor.run_remote_command("sudo systemctl stop supervisord")
605+
606+
# Run scontrol update with unsorted list of nodes
607+
remote_command_executor.run_remote_command(f"sudo -i scontrol update nodename={nodelist} nodeaddr={nodelist}")
608+
609+
assert_that(slurm_commands.get_node_attribute(nodes_in_queue1[0], "NodeAddr")).is_equal_to(nodes_in_queue1[0])
610+
assert_that(slurm_commands.get_node_attribute(nodes_in_queue2[0], "NodeAddr")).is_equal_to(nodes_in_queue2[0])
611+
612+
563613
@pytest.mark.usefixtures("region", "os", "instance", "scheduler")
564614
def test_slurm_overrides(
565615
scheduler,
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
Image:
2+
Os: {{ os }}
3+
HeadNode:
4+
InstanceType: {{ instance }}
5+
Networking:
6+
SubnetId: {{ public_subnet_id }}
7+
Ssh:
8+
KeyName: {{ key_name }}
9+
Scheduling:
10+
Scheduler: slurm
11+
SlurmQueues:
12+
- Name: queue1
13+
Networking:
14+
SubnetIds:
15+
- {{ private_subnet_id }}
16+
ComputeResources:
17+
- Name: resource1
18+
Instances:
19+
- InstanceType: {{ instance }}
20+
MinCount: 0
21+
MaxCount: {{ max_count_cr1 }}
22+
- Name: queue2
23+
Networking:
24+
SubnetIds:
25+
- {{ private_subnet_id }}
26+
ComputeResources:
27+
- Name: resource2
28+
Instances:
29+
- InstanceType: {{ instance }}
30+
MinCount: 0
31+
MaxCount: {{ max_count_cr2 }}

0 commit comments

Comments
 (0)