Skip to content

Commit 1ff2d51

Browse files
Merge branch 'v2.7.3' of https://github.com/oci-hpc/oci-hpc-clusternetwork-dev into v2.7.3
2 parents 01c081a + 8a01cbd commit 1ff2d51

File tree

29 files changed

+4
-1055
lines changed

29 files changed

+4
-1055
lines changed

README.md

Lines changed: 4 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ allow service compute_management to read app-catalog-listing in tenancy
1111
allow group user to manage all-resources in compartment compartmentName
1212
```
1313

14-
14+
## What is cluster resizing (resize.py) ?
15+
TODO
1516

1617
## What is cluster autoscaling ?
1718
TODO
@@ -37,144 +38,8 @@ or:
3738
`Allow dynamic-group instance_principal to manage all-resources in compartment compartmentName`
3839

3940

40-
# Cluster Network Resizing (via resize.py)
41-
42-
Cluster resizing refers to ability to add or remove nodes from an existing cluster network. It only applies to nodes with RDMA RoCEv2 (aka: cluster network) NICs, so HPC clusters created using BM.HPC2.36, BM.Optimized3.36 and BM.GPU4.8. Apart from add/remove, the resize.py script can also be used to reconfigure the nodes.
43-
44-
Resizing of HPC cluster with Cluster Network consist of 2 major sub-steps:
45-
- Add/Remove node (IaaS provisioning) to cluster – uses OCI Python SDK
46-
- Configure the nodes (uses Ansible)
47-
- Configures newly added nodes to be ready to run the jobs
48-
- Reconfigure services like Slurm to recognize new nodes on all nodes
49-
- Update rest of the nodes, when any node/s are removed (eg: Slurm config, /etc/hosts, etc.)
50-
51-
## resize.py usage
52-
53-
The resize.py is deployed on the bastion node as part of the HPC cluster Stack deployment.
54-
55-
```
56-
playbooks/resize.py
57-
58-
python3 playbooks/resize.py -h
59-
usage: resize.py [-h] [--compartment_ocid COMPARTMENT_OCID]
60-
[--cluster_name CLUSTER_NAME] [--nodes NODES [NODES ...]]
61-
[--slurm_only_update [{true,false}]]
62-
[{add,remove,list,reconfigure}] [number]
63-
64-
Script to resize the CN
65-
66-
positional arguments:
67-
{add,remove,list,reconfigure}
68-
Mode type. add/remove node options, implicitly
69-
configures newly added nodes. Also implicitly
70-
reconfigure/restart services like Slurm to recognize
71-
new nodes. Similarly for remove option, terminates
72-
nodes and implicitly reconfigure/restart services like
73-
Slurm on rest of the cluster nodes to remove reference
74-
to deleted nodes.
75-
number Number of nodes to add or delete if a list of
76-
hostnames is not defined
77-
78-
optional arguments:
79-
-h, --help show this help message and exit
80-
--compartment_ocid COMPARTMENT_OCID
81-
OCID of the compartment, defaults to the Compartment
82-
OCID of the localhost
83-
--cluster_name CLUSTER_NAME
84-
Name of the cluster to resize. Defaults to the name
85-
included in the bastion
86-
--nodes NODES [NODES ...]
87-
Number of nodes to add or delete if a list of
88-
hostnames is not defined
89-
--slurm_only_update [{true,false}]
90-
To update /etc/hosts, slurm config and restart slurm
91-
services.
92-
[opc@assuring-woodcock-bastion ~]$
93-
94-
95-
```
96-
97-
**Add nodes**
98-
99-
Consist of the following sub-steps:
100-
- Add node (IaaS provisioning) to cluster – uses OCI Python SDK
101-
- Configure the nodes (uses Ansible)
102-
- Configures newly added nodes to be ready to run the jobs
103-
- Reconfigure services like Slurm to recognize new nodes on all nodes
104-
105-
Add one node
106-
```
107-
python3 playbooks/resize.py add 1
108-
109-
```
110-
111-
Add three node
112-
```
113-
python3 playbooks/resize.py add 3
114-
115-
```
116-
117-
118-
**Remove nodes**
119-
120-
Consist of the following sub-steps:
121-
- Remove node/s (IaaS termination) from cluster – uses OCI Python SDK
122-
- Reconfigure rest of the nodes in the cluster (uses Ansible)
123-
- Remove reference to removed node/s on rest of the nodes (eg: update /etc/hosts, slurm configs, etc.)
124-
125-
126-
Remove specific node:
127-
```
128-
python3 playbooks/resize.py remove --nodes inst-dpi8e-assuring-woodcock
129-
```
130-
or
131-
132-
Remove a list of nodes (space seperated):
133-
```
134-
python3 playbooks/resize.py remove --nodes inst-dpi8e-assuring-woodcock inst-ed5yh-assuring-woodcock
135-
```
136-
or
137-
Remove one node randomly:
138-
```
139-
python3 playbooks/resize.py remove 1
140-
```
141-
or
142-
Remove 3 nodes randomly:
143-
```
144-
python3 playbooks/resize.py remove 3
145-
146-
```
147-
148-
**Reconfigure nodes**
149-
150-
This allows users to reconfigure nodes (Ansible tasks) of the cluster.
151-
152-
If you would like to do a **slurm config update ONLY** on all nodes of the cluster.
153-
154-
```
155-
python3 playbooks/resize.py reconfigure --slurm_only_update true
156-
```
157-
158-
Full reconfiguration of all nodes of the cluster. This will run the same steps, which are ran when a new cluster is created. If you manually updated configs which are created/updated as part of cluster configuration, then this command will overwrite your manual changes.
159-
160-
```
161-
python3 playbooks/resize.py reconfigure
162-
```
163-
164-
If you would like to fully reconfigure ONLY a specific node/nodes (space seperated).
165-
166-
```
167-
python3 playbooks/resize.py reconfigure [--nodes NODES [NODES ...]]
168-
Example: python3 resize.py reconfigure --nodes inst-gsezk-topical-goblin inst-jvpps-topical-goblin
169-
```
170-
171-
172-
173-
## Resizing (via OCI console)
174-
**Things to consider:**
175-
- If you resize from OCI console to reduce cluster network/instance pool size(scale down), the OCI platform decides which node to terminate (oldest node first)
176-
- OCI console only resizes the Cluster Network/Instance Pool, but it doesn't execute the ansible tasks (HPC Cluster Stack) required to configure the newly added nodes or to update the existing nodes when a node is removed (eg: updating /etc/hosts, slurm config, etc).
177-
41+
# Resizing (via resize.py or OCI console)
42+
TODO
17843

17944

18045
# Autoscaling

playbooks/resize_add_nodes.yml

Lines changed: 0 additions & 166 deletions
This file was deleted.

playbooks/resize_slurm_only_update.yml

Lines changed: 0 additions & 29 deletions
This file was deleted.

playbooks/roles/slurm-config-update/defaults/main.yml

Lines changed: 0 additions & 1 deletion
This file was deleted.

playbooks/roles/slurm-config-update/files/files

Whitespace-only changes.

playbooks/roles/slurm-config-update/handlers/main.yml

Lines changed: 0 additions & 20 deletions
This file was deleted.

playbooks/roles/slurm-config-update/tasks/cleanup.yml

Lines changed: 0 additions & 5 deletions
This file was deleted.

0 commit comments

Comments
 (0)