You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Revert "simplified resize.py script to only run on nodes which are mandatory and minimize ansible tasks ran by the script, add 2 new roles slurm-config-update and slurm-install-addnode to support updating only slurm config"
Copy file name to clipboardExpand all lines: README.md
+4-94Lines changed: 4 additions & 94 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,8 @@ allow service compute_management to read app-catalog-listing in tenancy
11
11
allow group user to manage all-resources in compartment compartmentName
12
12
```
13
13
14
-
14
+
## What is cluster resizing (resize.py) ?
15
+
TODO
15
16
16
17
## What is cluster autoscaling ?
17
18
TODO
@@ -37,99 +38,8 @@ or:
37
38
`Allow dynamic-group instance_principal to manage all-resources in compartment compartmentName`
38
39
39
40
40
-
# Cluster Network Resizing (via resize.py)
41
-
42
-
Cluster resizing refers to ability to add or remove nodes from an existing cluster network. It only applies to nodes with RDMA RoCEv2 (aka: cluster network) NICs, so HPC clusters created using BM.HPC2.36, BM.Optimized3.36 and BM.GPU4.8. Apart from add/remove, the resize.py script can also be used to reconfigure the nodes.
43
-
44
-
Resizing of HPC cluster with Cluster Network consist of 2 major sub-steps:
Full reconfiguration of all nodes of the cluster. This runs the same steps, which are ran when a new cluster is created. If you manually updated configs which are created/updated as part of cluster configuration, then this command will overwrite your manual changes.
114
-
115
-
```
116
-
python3 playbooks/resize.py.aug15 reconfigure
117
-
```
118
-
119
-
If you would like to fully reconfigure ONLY a specific node/nodes.
- If you resize from OCI console to reduce cluster network/instance pool size(scale down), the OCI platform decides which node to terminate (oldest node first)
131
-
- OCI console only resizes the Cluster Network/Instance Pool, but it doesn't execute the ansible tasks (HPC Cluster Stack) required to configure the newly added nodes or to update the existing nodes when a node is removed (eg: updating /etc/hosts, slurm config, etc).
0 commit comments