Skip to content

Commit 1067467

Browse files
Merge branch 'v2.7.3' of https://github.com/oci-hpc/oci-hpc-clusternetwork-dev into v2.7.3
2 parents 1f2c0c1 + 93513e0 commit 1067467

File tree

35 files changed

+1424
-11
lines changed

35 files changed

+1424
-11
lines changed

README.md

Lines changed: 139 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,7 @@ allow service compute_management to read app-catalog-listing in tenancy
1111
allow group user to manage all-resources in compartment compartmentName
1212
```
1313

14-
## What is cluster resizing (resize.py) ?
15-
TODO
14+
1615

1716
## What is cluster autoscaling ?
1817
TODO
@@ -38,8 +37,144 @@ or:
3837
`Allow dynamic-group instance_principal to manage all-resources in compartment compartmentName`
3938

4039

41-
# Resizing (via resize.py or OCI console)
42-
TODO
40+
# Cluster Network Resizing (via resize.py)
41+
42+
Cluster resizing refers to ability to add or remove nodes from an existing cluster network. It only applies to nodes with RDMA RoCEv2 (aka: cluster network) NICs, so HPC clusters created using BM.HPC2.36, BM.Optimized3.36 and BM.GPU4.8. Apart from add/remove, the resize.py script can also be used to reconfigure the nodes.
43+
44+
Resizing of HPC cluster with Cluster Network consist of 2 major sub-steps:
45+
- Add/Remove node (IaaS provisioning) to cluster – uses OCI Python SDK
46+
- Configure the nodes (uses Ansible)
47+
- Configures newly added nodes to be ready to run the jobs
48+
- Reconfigure services like Slurm to recognize new nodes on all nodes
49+
- Update rest of the nodes, when any node/s are removed (eg: Slurm config, /etc/hosts, etc.)
50+
51+
## resize.py usage
52+
53+
The resize.py is deployed on the bastion node as part of the HPC cluster Stack deployment.
54+
55+
```
56+
playbooks/resize.py
57+
58+
python3 playbooks/resize.py -h
59+
usage: resize.py [-h] [--compartment_ocid COMPARTMENT_OCID]
60+
[--cluster_name CLUSTER_NAME] [--nodes NODES [NODES ...]]
61+
[--slurm_only_update [{true,false}]]
62+
[{add,remove,list,reconfigure}] [number]
63+
64+
Script to resize the CN
65+
66+
positional arguments:
67+
{add,remove,list,reconfigure}
68+
Mode type. add/remove node options, implicitly
69+
configures newly added nodes. Also implicitly
70+
reconfigure/restart services like Slurm to recognize
71+
new nodes. Similarly for remove option, terminates
72+
nodes and implicitly reconfigure/restart services like
73+
Slurm on rest of the cluster nodes to remove reference
74+
to deleted nodes.
75+
number Number of nodes to add or delete if a list of
76+
hostnames is not defined
77+
78+
optional arguments:
79+
-h, --help show this help message and exit
80+
--compartment_ocid COMPARTMENT_OCID
81+
OCID of the compartment, defaults to the Compartment
82+
OCID of the localhost
83+
--cluster_name CLUSTER_NAME
84+
Name of the cluster to resize. Defaults to the name
85+
included in the bastion
86+
--nodes NODES [NODES ...]
87+
Number of nodes to add or delete if a list of
88+
hostnames is not defined
89+
--slurm_only_update [{true,false}]
90+
To update /etc/hosts, slurm config and restart slurm
91+
services.
92+
[opc@assuring-woodcock-bastion ~]$
93+
94+
95+
```
96+
97+
**Add nodes**
98+
99+
Consist of the following sub-steps:
100+
- Add node (IaaS provisioning) to cluster – uses OCI Python SDK
101+
- Configure the nodes (uses Ansible)
102+
- Configures newly added nodes to be ready to run the jobs
103+
- Reconfigure services like Slurm to recognize new nodes on all nodes
104+
105+
Add one node
106+
```
107+
python3 playbooks/resize.py add 1
108+
109+
```
110+
111+
Add three node
112+
```
113+
python3 playbooks/resize.py add 3
114+
115+
```
116+
117+
118+
**Remove nodes**
119+
120+
Consist of the following sub-steps:
121+
- Remove node/s (IaaS termination) from cluster – uses OCI Python SDK
122+
- Reconfigure rest of the nodes in the cluster (uses Ansible)
123+
- Remove reference to removed node/s on rest of the nodes (eg: update /etc/hosts, slurm configs, etc.)
124+
125+
126+
Remove specific node:
127+
```
128+
python3 playbooks/resize.py remove --nodes inst-dpi8e-assuring-woodcock
129+
```
130+
or
131+
132+
Remove a list of nodes (space seperated):
133+
```
134+
python3 playbooks/resize.py remove --nodes inst-dpi8e-assuring-woodcock inst-ed5yh-assuring-woodcock
135+
```
136+
or
137+
Remove one node randomly:
138+
```
139+
python3 playbooks/resize.py remove 1
140+
```
141+
or
142+
Remove 3 nodes randomly:
143+
```
144+
python3 playbooks/resize.py remove 3
145+
146+
```
147+
148+
**Reconfigure nodes**
149+
150+
This allows users to reconfigure nodes (Ansible tasks) of the cluster.
151+
152+
If you would like to do a **slurm config update ONLY** on all nodes of the cluster.
153+
154+
```
155+
python3 playbooks/resize.py reconfigure --slurm_only_update true
156+
```
157+
158+
Full reconfiguration of all nodes of the cluster. This will run the same steps, which are ran when a new cluster is created. If you manually updated configs which are created/updated as part of cluster configuration, then this command will overwrite your manual changes.
159+
160+
```
161+
python3 playbooks/resize.py reconfigure
162+
```
163+
164+
If you would like to fully reconfigure ONLY a specific node/nodes (space seperated).
165+
166+
```
167+
python3 playbooks/resize.py reconfigure [--nodes NODES [NODES ...]]
168+
Example: python3 resize.py reconfigure --nodes inst-gsezk-topical-goblin inst-jvpps-topical-goblin
169+
```
170+
171+
172+
173+
## Resizing (via OCI console)
174+
**Things to consider:**
175+
- If you resize from OCI console to reduce cluster network/instance pool size(scale down), the OCI platform decides which node to terminate (oldest node first)
176+
- OCI console only resizes the Cluster Network/Instance Pool, but it doesn't execute the ansible tasks (HPC Cluster Stack) required to configure the newly added nodes or to update the existing nodes when a node is removed (eg: updating /etc/hosts, slurm config, etc).
177+
43178

44179

45180
# Autoscaling

configure.sh

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,18 @@
88
#
99
execution=1
1010

11+
if [ -n "$1" ]; then
12+
playbook=$1
13+
else
14+
playbook="/opt/oci-hpc/playbooks/site.yml"
15+
fi
16+
17+
if [ -n "$2" ]; then
18+
inventory=$2
19+
else
20+
inventory="/etc/ansible/hosts"
21+
fi
22+
1123
ssh_options="-i ~/.ssh/cluster.key -o StrictHostKeyChecking=no"
1224

1325
if [ -f /opt/oci-hpc/playbooks/inventory ] ; then
@@ -58,7 +70,7 @@ done
5870

5971
if [[ $execution -eq 1 ]] ; then
6072
ANSIBLE_HOST_KEY_CHECKING=False ansible --private-key ~/.ssh/cluster.key all -m setup --tree /tmp/ansible > /dev/null 2>&1
61-
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook --private-key ~/.ssh/cluster.key /opt/oci-hpc/playbooks/site.yml
73+
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook --private-key ~/.ssh/cluster.key $playbook -i $inventory
6274
else
6375

6476
cat <<- EOF > /tmp/motd

0 commit comments

Comments
 (0)