Skip to content

Commit 25c5704

Browse files
Merge pull request #19 from oracle-quickstart/2.10.1
2.10.1
2 parents 10a42e0 + 87caac8 commit 25c5704

File tree

129 files changed

+2612
-735
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

129 files changed

+2612
-735
lines changed

README.md

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -265,7 +265,7 @@ Example:
265265
The name of the cluster must be
266266
queueName-clusterNumber-instanceType_keyword
267267

268-
The keyword will need to match the one from /opt/oci-hpc/conf/queues.conf to be regirstered in Slurm
268+
The keyword will need to match the one from /opt/oci-hpc/conf/queues.conf to be registered in Slurm
269269

270270
### Cluster Deletion:
271271
```
@@ -293,8 +293,8 @@ Example of cluster command to add a new user:
293293
```cluster user add name```
294294
By default, a `privilege` group is created that has access to the NFS and can have sudo access on all nodes (Defined at the stack creation. This group has ID 9876) The group name can be modified.
295295
```cluster user add name --gid 9876```
296-
To generate a user-specific key for passwordless ssh between nodes, use --ssh.
297-
```cluster user add name --ssh --gid 9876```
296+
To avoid generating a user-specific key for passwordless ssh between nodes, use --nossh.
297+
```cluster user add name --nossh --gid 9876```
298298

299299
# Shared home folder
300300

@@ -318,3 +318,43 @@ $ max_nodes --> Information about all the partitions and their respective cluste
318318

319319
$ max_nodes --include_cluster_names xxx yyy zzz --> where xxx, yyy, zzz are cluster names. Provide a space separated list of cluster names to be considered for displaying the information about clusters and maximum number of nodes distributed evenly per partition
320320

321+
322+
## validation.py usage
323+
324+
Use the alias "validate" to run the python script validation.py. You can run this script only from bastion.
325+
326+
The script performs these checks.
327+
-> Check the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files.
328+
-> PCIe bandwidth check
329+
-> GPU Throttle check
330+
-> Check whether md5 sum of /etc/hosts file on nodes matches that on bastion
331+
332+
Provide at least one argument: [-n NUM_NODES] [-p PCIE] [-g GPU_THROTTLE] [-e ETC_HOSTS]
333+
334+
Optional argument with [-n NUM_NODES] [-p PCIE] [-g GPU_THROTTLE] [-e ETC_HOSTS]: [-cn CLUSTER_NAMES]
335+
Provide a file that lists each cluster on a separate line for which you want to validate the number of nodes and/or pcie check and/or gpu throttle check and/or /etc/hosts md5 sum.
336+
337+
For pcie, gpu throttle, and /etc/hosts md5 sum check, you can either provide y or Y along with -cn or you can give the hostfile path (each host on a separate line) for each argument. For number of nodes check, either provide y or give y along with -cn.
338+
339+
Below are some examples for running this script.
340+
341+
validate -n y --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. The clusters considered will be the default cluster if any and cluster(s) found in /opt/oci-hpc/autoscaling/clusters directory. The number of nodes considered will be from the resize script using the clusters we got before.
342+
343+
validate -n y -cn <cluster name file> --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. It will also check whether md5 sum of /etc/hosts file on all nodes matches that on bastion. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
344+
345+
validate -p y -cn <cluster name file> --> This will run the pcie bandwidth check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
346+
347+
validate -p <pcie host file> --> This will run the pcie bandwidth check on the hosts provided in the file given. The pcie host file should have a host name on each line.
348+
349+
validate -g y -cn <cluster name file> --> This will run the GPU throttle check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
350+
351+
validate -g <gpu check host file> --> This will run the GPU throttle check on the hosts provided in the file given. The gpu check host file should have a host name on each line.
352+
353+
validate -e y -cn <cluster name file> --> This will run the GPU throttle check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
354+
355+
validate -e <md5 sum check host file> --> This will run the /etc/hosts md5 sum check on the hosts provided in the file given. The md5 sum check host file should have a host name on each line.
356+
357+
You can combine all the options together such as:
358+
validate -n y -p y -g y -e y -cn <cluster name file>
359+
360+

autoscaling/crontab/autoscale_slurm.sh

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,11 @@ def getClusterName(node):
169169
for output in stdout.split('\n')[:-1]:
170170
if "Switches=" in output:
171171
clusterName=output.split()[0].split('SwitchName=')[1]
172+
break
173+
elif "SwitchName=inactive-" in output:
174+
continue
175+
else:
176+
clusterName=output.split()[0].split('SwitchName=')[1]
172177
elif len(stdout.split('\n')) == 2:
173178
clusterName=stdout.split('\n')[0].split()[0].split('SwitchName=')[1]
174179
if clusterName.startswith("inactive-"):
@@ -352,7 +357,7 @@ try:
352357
cluster_name=cluster[0]
353358
print ("Deleting cluster "+cluster_name)
354359
subprocess.Popen([script_path+'/delete_cluster.sh',cluster_name])
355-
time.sleep(1)
360+
time.sleep(5)
356361

357362
for cluster_name in nodes_to_destroy.keys():
358363
print ("Resizing cluster "+cluster_name)
@@ -374,7 +379,6 @@ try:
374379
subprocess.Popen([script_path+'/resize.sh','--force','--cluster_name',cluster_name,'remove','--remove_unreachable','--nodes']+initial_nodes)
375380
if len(unreachable_nodes) > 0:
376381
subprocess.Popen([script_path+'/resize.sh','--cluster_name',cluster_name,'remove_unreachable','--nodes']+unreachable_nodes)
377-
378382
time.sleep(1)
379383

380384
for index,cluster in enumerate(cluster_to_build):

autoscaling/tf_init/bastion_update.tf

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,14 @@ resource "local_file" "inventory" {
2222
bastion_ip = var.bastion_ip,
2323
backup_name = var.backup_name,
2424
backup_ip = var.backup_ip,
25+
login_name = var.login_name,
26+
login_ip = var.login_ip,
2527
compute = var.node_count > 0 ? zipmap(local.cluster_instances_names, local.cluster_instances_ips) : zipmap([],[])
2628
public_subnet = var.public_subnet,
2729
private_subnet = var.private_subnet,
28-
nfs = local.cluster_instances_names[0],
30+
rdma_network = cidrhost(var.rdma_subnet, 0),
31+
rdma_netmask = cidrnetmask(var.rdma_subnet),
32+
nfs = var.use_scratch_nfs ? local.cluster_instances_names[0] : "",
2933
scratch_nfs = var.use_scratch_nfs,
3034
cluster_nfs = var.use_cluster_nfs,
3135
home_nfs = var.home_nfs,
@@ -53,7 +57,7 @@ resource "local_file" "inventory" {
5357
cluster_mount_ip = local.mount_ip,
5458
cluster_name = local.cluster_name,
5559
shape = var.cluster_network ? var.cluster_network_shape : var.instance_pool_shape,
56-
instance_pool_ocpus=var.instance_pool_ocpus,
60+
instance_pool_ocpus=local.instance_pool_ocpus,
5761
queue=var.queue,
5862
instance_type=var.instance_type,
5963
autoscaling_monitoring = var.autoscaling_monitoring,
@@ -63,7 +67,9 @@ resource "local_file" "inventory" {
6367
privilege_group_name = var.privilege_group_name,
6468
latency_check = var.latency_check
6569
bastion_username = var.bastion_username,
66-
compute_username = var.compute_username
70+
compute_username = var.compute_username,
71+
pam = var.pam,
72+
sacct_limits = var.sacct_limits
6773
})
6874
filename = "${local.bastion_path}/inventory"
6975
}

autoscaling/tf_init/inventory.tpl

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
${bastion_name} ansible_host=${bastion_ip} ansible_user=${bastion_username} role=bastion
33
[slurm_backup]
44
%{ if backup_name != "" }${backup_name} ansible_host=${backup_ip} ansible_user=${bastion_username} role=bastion%{ endif }
5+
[login]
6+
%{ if login_name != "" }${login_name} ansible_host=${login_ip} ansible_user=${compute_username} role=login%{ endif }
57
[compute_to_add]
68
[compute_configured]
79
%{ for host, ip in compute ~}
@@ -12,15 +14,15 @@ ${host} ansible_host=${ip} ansible_user=${compute_username} role=compute
1214
compute_to_add
1315
compute_configured
1416
[nfs]
15-
${nfs}
17+
%{ if nfs != "" }${nfs} ansible_user=${compute_username} role=nfs%{ endif }
1618
[all:children]
1719
bastion
1820
compute
1921
[all:vars]
2022
ansible_connection=ssh
2123
ansible_user=${compute_username}
22-
rdma_network=192.168.128.0
23-
rdma_netmask=255.255.240.0
24+
rdma_network=${rdma_network}
25+
rdma_netmask=${rdma_netmask}
2426
public_subnet=${public_subnet}
2527
private_subnet=${private_subnet}
2628
nvme_path=/mnt/localdisk/
@@ -62,3 +64,5 @@ privilege_group_name=${privilege_group_name}
6264
latency_check=${latency_check}
6365
compute_username=${compute_username}
6466
bastion_username=${bastion_username}
67+
pam = ${pam}
68+
sacct_limits=${sacct_limits}

autoscaling/tf_init/locals.tf

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ locals {
33
cluster_instances_ids = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.id : data.oci_core_instance.instance_pool_instances.*.id
44
cluster_instances_names = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.display_name : data.oci_core_instance.instance_pool_instances.*.display_name
55
image_ocid = var.unsupported ? var.image_ocid : var.image
6+
7+
shape = var.cluster_network ? var.cluster_network_shape : var.instance_pool_shape
8+
instance_pool_ocpus = local.shape == "VM.DenseIO.E4.Flex" ? var.instance_pool_ocpus_denseIO_flex : var.instance_pool_ocpus
69
// ips of the instances
710
cluster_instances_ips = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.private_ip : data.oci_core_instance.instance_pool_instances.*.private_ip
811

@@ -20,7 +23,7 @@ locals {
2023
// image = (var.cluster_network && var.use_marketplace_image == true) || (var.cluster_network == false && var.use_marketplace_image == false) ? var.image : data.oci_core_images.linux.images.0.id
2124

2225
// is_bastion_flex_shape = length(regexall(".*VM.*.*Flex$", var.bastion_shape)) > 0 ? [var.bastion_ocpus]:[]
23-
is_instance_pool_flex_shape = length(regexall(".*VM.*.*Flex$", var.instance_pool_shape)) > 0 ? [var.instance_pool_ocpus]:[]
26+
is_instance_pool_flex_shape = length(regexall(".*VM.*.*Flex$", var.instance_pool_shape)) > 0 ? [local.instance_pool_ocpus]:[]
2427

2528
// bastion_mount_ip = var.bastion_block ? element(concat(oci_core_volume_attachment.bastion_volume_attachment.*.ipv4, [""]), 0) : "none"
2629

autoscaling/tf_init/versions.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ terraform {
33
required_providers {
44
oci = {
55
source = "oracle/oci"
6-
version = "4.99.0"
6+
version = "4.112.0"
77
}
88
}
99
}

bastion.tf

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ resource "null_resource" "bastion" {
7474

7575
provisioner "remote-exec" {
7676
inline = [
77+
"#!/bin/bash",
7778
"sudo mkdir -p /opt/oci-hpc",
7879
"sudo chown ${var.bastion_username}:${var.bastion_username} /opt/oci-hpc/",
7980
"mkdir -p /opt/oci-hpc/bin",
@@ -176,6 +177,7 @@ resource "null_resource" "bastion" {
176177

177178
provisioner "remote-exec" {
178179
inline = [
180+
"#!/bin/bash",
179181
"chmod 600 /home/${var.bastion_username}/.ssh/cluster.key",
180182
"cp /home/${var.bastion_username}/.ssh/cluster.key /home/${var.bastion_username}/.ssh/id_rsa",
181183
"chmod a+x /opt/oci-hpc/bin/*.sh",
@@ -201,12 +203,14 @@ resource "null_resource" "cluster" {
201203
bastion_ip = oci_core_instance.bastion.private_ip,
202204
backup_name = var.slurm_ha ? oci_core_instance.backup[0].display_name : "",
203205
backup_ip = var.slurm_ha ? oci_core_instance.backup[0].private_ip: "",
206+
login_name = var.login_node ? oci_core_instance.login[0].display_name : "",
207+
login_ip = var.login_node ? oci_core_instance.login[0].private_ip: "",
204208
compute = var.node_count > 0 ? zipmap(local.cluster_instances_names, local.cluster_instances_ips) : zipmap([],[])
205209
public_subnet = data.oci_core_subnet.public_subnet.cidr_block,
206210
private_subnet = data.oci_core_subnet.private_subnet.cidr_block,
207211
rdma_network = cidrhost(var.rdma_subnet, 0),
208212
rdma_netmask = cidrnetmask(var.rdma_subnet),
209-
nfs = var.node_count > 0 ? local.cluster_instances_names[0] : "",
213+
nfs = var.node_count > 0 && var.use_scratch_nfs ? local.cluster_instances_names[0] : "",
210214
home_nfs = var.home_nfs,
211215
create_fss = var.create_fss,
212216
home_fss = var.home_fss,
@@ -232,8 +236,8 @@ resource "null_resource" "cluster" {
232236
cluster_mount_ip = local.mount_ip,
233237
autoscaling = var.autoscaling,
234238
cluster_name = local.cluster_name,
235-
shape = var.cluster_network ? var.cluster_network_shape : var.instance_pool_shape,
236-
instance_pool_ocpus = var.instance_pool_ocpus,
239+
shape = local.shape,
240+
instance_pool_ocpus = local.instance_pool_ocpus,
237241
queue=var.queue,
238242
monitoring = var.monitoring,
239243
hyperthreading = var.hyperthreading,
@@ -248,7 +252,14 @@ resource "null_resource" "cluster" {
248252
pyxis = var.pyxis,
249253
privilege_sudo = var.privilege_sudo,
250254
privilege_group_name = var.privilege_group_name,
251-
latency_check = var.latency_check
255+
latency_check = var.latency_check,
256+
pam = var.pam,
257+
sacct_limits = var.sacct_limits,
258+
inst_prin = var.inst_prin,
259+
region = var.region,
260+
tenancy_ocid = var.tenancy_ocid,
261+
api_fingerprint = var.api_fingerprint,
262+
api_user_ocid = var.api_user_ocid
252263
})
253264

254265
destination = "/opt/oci-hpc/playbooks/inventory"
@@ -303,7 +314,7 @@ resource "null_resource" "cluster" {
303314
private_subnet = data.oci_core_subnet.private_subnet.cidr_block,
304315
private_subnet_id = local.subnet_id,
305316
targetCompartment = var.targetCompartment,
306-
instance_pool_ocpus = var.instance_pool_ocpus,
317+
instance_pool_ocpus = local.instance_pool_ocpus,
307318
instance_pool_memory = var.instance_pool_memory,
308319
instance_pool_custom_memory = var.instance_pool_custom_memory,
309320
queue=var.queue,
@@ -325,14 +336,18 @@ resource "null_resource" "cluster" {
325336
bastion_ip = oci_core_instance.bastion.private_ip,
326337
backup_name = var.slurm_ha ? oci_core_instance.backup[0].display_name : "",
327338
backup_ip = var.slurm_ha ? oci_core_instance.backup[0].private_ip: "",
339+
login_name = var.login_node ? oci_core_instance.login[0].display_name : "",
340+
login_ip = var.login_node ? oci_core_instance.login[0].private_ip: "",
328341
compute = var.node_count > 0 ? zipmap(local.cluster_instances_names, local.cluster_instances_ips) : zipmap([],[])
329342
public_subnet = data.oci_core_subnet.public_subnet.cidr_block,
330343
public_subnet_id = local.bastion_subnet_id,
331344
private_subnet = data.oci_core_subnet.private_subnet.cidr_block,
332345
private_subnet_id = local.subnet_id,
346+
rdma_subnet = var.rdma_subnet,
333347
nfs = var.node_count > 0 ? local.cluster_instances_names[0] : "",
334348
scratch_nfs = var.use_scratch_nfs && var.node_count > 0,
335349
scratch_nfs_path = var.scratch_nfs_path,
350+
use_scratch_nfs = var.use_scratch_nfs,
336351
slurm = var.slurm,
337352
rack_aware = var.rack_aware,
338353
slurm_nfs_path = var.add_nfs ? var.nfs_source_path : var.cluster_nfs_path
@@ -376,7 +391,9 @@ resource "null_resource" "cluster" {
376391
private_deployment = var.private_deployment,
377392
use_multiple_ads = var.use_multiple_ads,
378393
bastion_username = var.bastion_username,
379-
compute_username = var.compute_username
394+
compute_username = var.compute_username,
395+
pam = var.pam,
396+
sacct_limits = var.sacct_limits
380397
})
381398

382399
destination = "/opt/oci-hpc/conf/variables.tf"
@@ -409,7 +426,7 @@ provisioner "file" {
409426
}
410427
provisioner "file" {
411428
content = base64decode(var.api_user_key)
412-
destination = "/opt/oci-hpc/autoscaling/credentials/key.initial"
429+
destination = "/opt/oci-hpc/autoscaling/credentials/key.pem"
413430
connection {
414431
host = local.host
415432
type = "ssh"
@@ -420,13 +437,12 @@ provisioner "file" {
420437

421438
provisioner "remote-exec" {
422439
inline = [
440+
"#!/bin/bash",
423441
"chmod 755 /opt/oci-hpc/autoscaling/crontab/*.sh",
424-
"chmod 755 /opt/oci-hpc/autoscaling/credentials/key.sh",
425-
"/opt/oci-hpc/autoscaling/credentials/key.sh /opt/oci-hpc/autoscaling/credentials/key.initial /opt/oci-hpc/autoscaling/credentials/key.pem > /opt/oci-hpc/autoscaling/credentials/key.log",
426442
"chmod 600 /opt/oci-hpc/autoscaling/credentials/key.pem",
427443
"echo ${var.configure} > /tmp/configure.conf",
428-
"timeout 2h /opt/oci-hpc/bin/configure.sh",
429-
"exit_code=$?",
444+
"timeout 2h /opt/oci-hpc/bin/configure.sh | tee /opt/oci-hpc/logs/initial_configure.log",
445+
"exit_code=$${PIPESTATUS[0]}",
430446
"/opt/oci-hpc/bin/initial_monitoring.sh",
431447
"exit $exit_code" ]
432448
connection {

0 commit comments

Comments
 (0)