Skip to content

Commit 8a01cbd

Browse files
Revert "simplified resize.py script to only run on nodes which are mandatory and minimize ansible tasks ran by the script, add 2 new roles slurm-config-update and slurm-install-addnode to support updating only slurm config"
This reverts commit 8719093.
1 parent c07261d commit 8a01cbd

File tree

29 files changed

+4
-1010
lines changed

29 files changed

+4
-1010
lines changed

README.md

Lines changed: 4 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ allow service compute_management to read app-catalog-listing in tenancy
1111
allow group user to manage all-resources in compartment compartmentName
1212
```
1313

14-
14+
## What is cluster resizing (resize.py) ?
15+
TODO
1516

1617
## What is cluster autoscaling ?
1718
TODO
@@ -37,99 +38,8 @@ or:
3738
`Allow dynamic-group instance_principal to manage all-resources in compartment compartmentName`
3839

3940

40-
# Cluster Network Resizing (via resize.py)
41-
42-
Cluster resizing refers to ability to add or remove nodes from an existing cluster network. It only applies to nodes with RDMA RoCEv2 (aka: cluster network) NICs, so HPC clusters created using BM.HPC2.36, BM.Optimized3.36 and BM.GPU4.8. Apart from add/remove, the resize.py script can also be used to reconfigure the nodes.
43-
44-
Resizing of HPC cluster with Cluster Network consist of 2 major sub-steps:
45-
- Add/Remove node (IaaS provisioning) to cluster – uses OCI Python SDK
46-
- Configure the nodes (uses Ansible)
47-
- Configures newly added nodes to be ready to run the jobs
48-
- Reconfigure services like Slurm to recognize new nodes on all nodes
49-
- Update rest of the nodes, when any node/s are removed (eg: Slurm config, /etc/hosts, etc.)
50-
51-
## resize.py usage
52-
53-
The resize.py is deployed on the bastion node as part of the HPC cluster Stack deployment.
54-
55-
```
56-
playbooks/resize.py
57-
```
58-
59-
**Add nodes**
60-
61-
Consist of the following sub-steps:
62-
- Add node (IaaS provisioning) to cluster – uses OCI Python SDK
63-
- Configure the nodes (uses Ansible)
64-
- Configures newly added nodes to be ready to run the jobs
65-
- Reconfigure services like Slurm to recognize new nodes on all nodes
66-
67-
68-
```
69-
python3 playbooks/resize.py.aug15 add 1
70-
71-
```
72-
73-
**Remove nodes**
74-
75-
Consist of the following sub-steps:
76-
- Remove node/s (IaaS termination) from cluster – uses OCI Python SDK
77-
- Reconfigure rest of the nodes in the cluster (uses Ansible)
78-
- Remove reference to removed node/s on rest of the nodes (eg: update /etc/hosts, slurm configs, etc.)
79-
80-
81-
Remove specific node:
82-
```
83-
python3 playbooks/resize.py.aug15 remove --nodes inst-dpi8e-assuring-woodcock
84-
```
85-
or
86-
87-
Remove a list of nodes (space seperated):
88-
```
89-
python3 playbooks/resize.py.aug15 remove --nodes inst-dpi8e-assuring-woodcock inst-ed5yh-assuring-woodcock
90-
```
91-
or
92-
Remove one node randomly:
93-
```
94-
python3 playbooks/resize.py.aug15 remove 1
95-
```
96-
or
97-
Remove 3 nodes randomly:
98-
```
99-
python3 playbooks/resize.py.aug15 remove 3
100-
101-
```
102-
103-
**Reconfigure nodes**
104-
105-
This allows users to reconfigure nodes (Ansible tasks) of the cluster.
106-
107-
If you would like to do a slurm config update on all nodes of the cluster.
108-
109-
```
110-
python3 playbooks/resize.py.aug15 reconfigure --slurm_only_update true
111-
```
112-
113-
Full reconfiguration of all nodes of the cluster. This runs the same steps, which are ran when a new cluster is created. If you manually updated configs which are created/updated as part of cluster configuration, then this command will overwrite your manual changes.
114-
115-
```
116-
python3 playbooks/resize.py.aug15 reconfigure
117-
```
118-
119-
If you would like to fully reconfigure ONLY a specific node/nodes.
120-
121-
```
122-
python3 playbooks/resize.py.aug15 reconfigure [--nodes NODES [NODES ...]]
123-
Example: python3 resize.py.aug15 reconfigure --nodes inst-gsezk-topical-goblin inst-jvpps-topical-goblin inst-ytuqj-topical-goblin
124-
```
125-
126-
127-
128-
## Resizing (via OCI console)
129-
**Things to consider:**
130-
- If you resize from OCI console to reduce cluster network/instance pool size(scale down), the OCI platform decides which node to terminate (oldest node first)
131-
- OCI console only resizes the Cluster Network/Instance Pool, but it doesn't execute the ansible tasks (HPC Cluster Stack) required to configure the newly added nodes or to update the existing nodes when a node is removed (eg: updating /etc/hosts, slurm config, etc).
132-
41+
# Resizing (via resize.py or OCI console)
42+
TODO
13343

13444

13545
# Autoscaling

playbooks/resize_add_nodes.yml

Lines changed: 0 additions & 166 deletions
This file was deleted.

playbooks/resize_slurm_only_update.yml

Lines changed: 0 additions & 29 deletions
This file was deleted.

playbooks/roles/slurm-config-update/defaults/main.yml

Lines changed: 0 additions & 1 deletion
This file was deleted.

playbooks/roles/slurm-config-update/files/files

Whitespace-only changes.

playbooks/roles/slurm-config-update/handlers/main.yml

Lines changed: 0 additions & 20 deletions
This file was deleted.

playbooks/roles/slurm-config-update/tasks/cleanup.yml

Lines changed: 0 additions & 5 deletions
This file was deleted.

playbooks/roles/slurm-config-update/tasks/common.yml

Lines changed: 0 additions & 28 deletions
This file was deleted.

0 commit comments

Comments
 (0)