Skip to content

Commit f0499b7

Browse files
Merge pull request #28 from oracle-quickstart/2.10.3
2.10.3
2 parents 763d350 + 3c2978a commit f0499b7

File tree

105 files changed

+12846
-9565
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

105 files changed

+12846
-9565
lines changed

README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ The stack allowa various combination of OS. Here is a list of what has been test
3737
| OL7 | OL7 |
3838
| OL7 | OL8 |
3939
| OL7 | CentOS7 |
40+
| OL8 | OL8 |
41+
| OL8 | OL7 |
4042
| Ubuntu 20.04 | Ubuntu 20.04 |
4143

4244
When switching to Ubuntu, make sure the username is changed from opc to Ubuntu in the ORM for both the bastion and compute nodes.
@@ -358,3 +360,53 @@ You can combine all the options together such as:
358360
validate -n y -p y -g y -e y -cn <cluster name file>
359361

360362

363+
## /opt/oci-hpc/scripts/collect_logs.py
364+
This is a script to collect nvidia bug report, sosreport, console history logs.
365+
366+
The script needs to be run from the bastion. In the case where the host is not ssh-able, it will get only console history logs for the same.
367+
368+
It requires the below argument.
369+
--hostname <HOSTNAME>
370+
371+
And --compartment-id <COMPARTMENT_ID> is optional (i.e. assumption is the host is in the same compartment as the bastion).
372+
373+
Where HOSTNAME is the node name for which you need the above logs and COMPARTMENT_ID is the OCID of the compartment where the node is.
374+
375+
The script will get all the above logs and put them in a folder specific to each node in /home/{user}. It will give the folder name as the output.
376+
377+
Assumption: For getting the console history logs, the script expects to have the node name in /etc/hosts file.
378+
379+
Examples:
380+
381+
python3 collect_logs.py --hostname compute-permanent-node-467
382+
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_06132023191024
383+
384+
python3 collect_logs.py --hostname inst-jxwf6-keen-drake
385+
The nvidia bug report, sosreport, and console history logs for inst-jxwf6-keen-drake are at /home/ubuntu/inst-jxwf6-keen-drake_11112022001138
386+
387+
for x in `less /home/opc/hostlist` ; do echo $x ; python3 collect_logs.py --hostname $x; done ;
388+
compute-permanent-node-467
389+
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_11112022011318
390+
compute-permanent-node-787
391+
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-787 are at /home/ubuntu/compute-permanent-node-787_11112022011835
392+
393+
Where hostlist had the below contents
394+
compute-permanent-node-467
395+
compute-permanent-node-787
396+
397+
398+
## Collect RDMA NIC Metrics and Upload to Object Storage
399+
400+
OCI-HPC is deployed in customer tenancy. So, OCI service teams cannot access metrics from these OCI-HPC stack clusters. Due to overcome this issue, in release,
401+
we introduce a feature to collect RDMA NIC Metrics and upload those metrics to Object Storage. Later on, that Object Storage URL could be shared with OCI service
402+
teams. After that URL, OCI service teams could access metrics and use those metrics for debugging purpose.
403+
404+
To collect RDMA NIC Metrics and upload those to Object Storage, user needs to follow these following steps:
405+
406+
Step 1: Create a PAR (PreAuthenticated Request)
407+
For creating a PAR, user needs to select check-box "Create Object Storage PAR" during Resource Manager's stack creation.
408+
By default, this check box is enabled. By selecting, this check-box, a PAR would be created.
409+
410+
Step 2: Use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage.
411+
User needs to use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage. User could configure metrics
412+
collection limit and interval through config file: rdma_metrics_collection_config.conf.

autoscaling/tf_init/bastion_update.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ resource "local_file" "hosts" {
1616
}
1717

1818
resource "local_file" "inventory" {
19-
depends_on = [oci_core_cluster_network.cluster_network]
19+
depends_on = [oci_core_cluster_network.cluster_network, oci_core_cluster_network.cluster_network]
2020
content = templatefile("${local.bastion_path}/inventory.tpl", {
2121
bastion_name = var.bastion_name,
2222
bastion_ip = var.bastion_ip,

autoscaling/tf_init/cluster-network-configuration.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
resource "oci_core_instance_configuration" "cluster-network-instance_configuration" {
2-
count = var.cluster_network ? 1 : 0
2+
count = ( ! var.compute_cluster ) && var.cluster_network ? 1 : 0
33
depends_on = [oci_core_app_catalog_subscription.mp_image_subscription]
44
compartment_id = var.targetCompartment
55
display_name = local.cluster_name

autoscaling/tf_init/cluster-network.tf

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
resource "oci_core_volume" "nfs-cluster-network-volume" {
2-
count = var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
2+
count = ( ! var.compute_cluster ) && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
33
availability_domain = var.ad
44
compartment_id = var.targetCompartment
55
display_name = "${local.cluster_name}-nfs-volume"
@@ -9,7 +9,7 @@ resource "oci_core_volume" "nfs-cluster-network-volume" {
99
}
1010

1111
resource "oci_core_volume_attachment" "cluster_network_volume_attachment" {
12-
count = var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
12+
count = ( ! var.compute_cluster ) && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
1313
attachment_type = "iscsi"
1414
volume_id = oci_core_volume.nfs-cluster-network-volume[0].id
1515
instance_id = local.cluster_instances_ids[0]
@@ -18,7 +18,7 @@ resource "oci_core_volume_attachment" "cluster_network_volume_attachment" {
1818
}
1919

2020
resource "oci_core_cluster_network" "cluster_network" {
21-
count = var.cluster_network && var.node_count > 0 ? 1 : 0
21+
count = ( ! var.compute_cluster ) && var.cluster_network && var.node_count > 0 ? 1 : 0
2222
depends_on = [oci_core_app_catalog_subscription.mp_image_subscription, oci_core_subnet.private-subnet, oci_core_subnet.public-subnet]
2323
compartment_id = var.targetCompartment
2424
instance_pools {
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
resource "oci_core_compute_cluster" "compute_cluster" {
2+
count = var.compute_cluster && var.cluster_network && var.node_count > 0 ? 1 : 0
3+
#Required
4+
availability_domain = var.ad
5+
compartment_id = var.targetCompartment
6+
7+
#Optional
8+
display_name = local.cluster_name
9+
freeform_tags = {
10+
"cluster_name" = local.cluster_name
11+
"parent_cluster" = local.cluster_name
12+
}
13+
}
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
resource "oci_core_volume" "nfs-compute-cluster-volume" {
2+
count = var.compute_cluster && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
3+
availability_domain = var.ad
4+
compartment_id = var.targetCompartment
5+
display_name = "${local.cluster_name}-nfs-volume"
6+
7+
size_in_gbs = var.cluster_block_volume_size
8+
vpus_per_gb = split(".", var.cluster_block_volume_performance)[0]
9+
}
10+
11+
resource "oci_core_volume_attachment" "compute_cluster_volume_attachment" {
12+
count = var.compute_cluster && var.scratch_nfs_type_cluster == "block" && var.node_count > 0 ? 1 : 0
13+
attachment_type = "iscsi"
14+
volume_id = oci_core_volume.nfs-compute-cluster-volume[0].id
15+
instance_id = oci_core_instance.compute_cluster_instances[0].id
16+
display_name = "${local.cluster_name}-compute-cluster-volume-attachment"
17+
device = "/dev/oracleoci/oraclevdb"
18+
}
19+
20+
resource "oci_core_instance" "compute_cluster_instances" {
21+
count = var.compute_cluster ? var.node_count : 0
22+
depends_on = [oci_core_compute_cluster.compute_cluster]
23+
availability_domain = var.ad
24+
compartment_id = var.targetCompartment
25+
shape = var.cluster_network_shape
26+
27+
agent_config {
28+
is_management_disabled = true
29+
}
30+
31+
display_name = "${local.cluster_name}-node-${var.compute_cluster_start_index+count.index}"
32+
33+
freeform_tags = {
34+
"cluster_name" = local.cluster_name
35+
"parent_cluster" = local.cluster_name
36+
"user" = var.tags
37+
}
38+
39+
metadata = {
40+
ssh_authorized_keys = file("/home/${var.bastion_username}/.ssh/id_rsa.pub")
41+
user_data = base64encode(data.template_file.config.rendered)
42+
}
43+
source_details {
44+
source_id = local.cluster_network_image
45+
source_type = "image"
46+
boot_volume_size_in_gbs = var.boot_volume_size
47+
}
48+
compute_cluster_id=length(var.compute_cluster_id) > 2 ? var.compute_cluster_id : oci_core_compute_cluster.compute_cluster[0].id
49+
create_vnic_details {
50+
subnet_id = local.subnet_id
51+
assign_public_ip = false
52+
}
53+
}

autoscaling/tf_init/data.tf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ data "oci_core_services" "services" {
1010
}
1111

1212
data "oci_core_cluster_network_instances" "cluster_network_instances" {
13-
count = var.cluster_network && var.node_count > 0 ? 1 : 0
13+
count = (! var.compute_cluster) && var.cluster_network && var.node_count > 0 ? 1 : 0
1414
cluster_network_id = oci_core_cluster_network.cluster_network[0].id
1515
compartment_id = var.targetCompartment
1616
}
@@ -22,7 +22,7 @@ data "oci_core_instance_pool_instances" "instance_pool_instances" {
2222
}
2323

2424
data "oci_core_instance" "cluster_network_instances" {
25-
count = var.cluster_network && var.node_count > 0 ? var.node_count : 0
25+
count = (! var.compute_cluster) && var.cluster_network && var.node_count > 0 ? var.node_count : 0
2626
instance_id = data.oci_core_cluster_network_instances.cluster_network_instances[0].instances[count.index]["id"]
2727
}
2828

autoscaling/tf_init/inventory.tpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bastion]
2-
${bastion_name} ansible_host=${bastion_ip} ansible_user=${bastion_username} role=bastion
2+
${bastion_name} ansible_host=${bastion_ip} ansible_user=${bastion_username} role=bastion ansible_python_interpreter=/usr/bin/python
33
[slurm_backup]
44
%{ if backup_name != "" }${backup_name} ansible_host=${backup_ip} ansible_user=${bastion_username} role=bastion%{ endif }
55
[login]

autoscaling/tf_init/locals.tf

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
locals {
22
// display names of instances
3-
cluster_instances_ids = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.id : data.oci_core_instance.instance_pool_instances.*.id
4-
cluster_instances_names = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.display_name : data.oci_core_instance.instance_pool_instances.*.display_name
3+
cluster_instances_ids = var.compute_cluster ? oci_core_instance.compute_cluster_instances.*.id : var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.id : data.oci_core_instance.instance_pool_instances.*.id
4+
cluster_instances_names = var.compute_cluster ? oci_core_instance.compute_cluster_instances.*.display_name :var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.display_name : data.oci_core_instance.instance_pool_instances.*.display_name
55
image_ocid = var.unsupported ? var.image_ocid : var.image
66

77
shape = var.cluster_network ? var.cluster_network_shape : var.instance_pool_shape
88
instance_pool_ocpus = local.shape == "VM.DenseIO.E4.Flex" ? var.instance_pool_ocpus_denseIO_flex : var.instance_pool_ocpus
99
// ips of the instances
10-
cluster_instances_ips = var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.private_ip : data.oci_core_instance.instance_pool_instances.*.private_ip
10+
cluster_instances_ips = var.compute_cluster ? oci_core_instance.compute_cluster_instances.*.private_ip : var.cluster_network ? data.oci_core_instance.cluster_network_instances.*.private_ip : data.oci_core_instance.instance_pool_instances.*.private_ip
1111

1212
// subnet id derived either from created subnet or existing if specified
1313
subnet_id = var.private_deployment ? var.use_existing_vcn ? var.private_subnet_id : element(concat(oci_core_subnet.private-subnet.*.id, [""]), 1) : var.use_existing_vcn ? var.private_subnet_id : element(concat(oci_core_subnet.private-subnet.*.id, [""]), 0)

autoscaling/tf_init/outputs.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,5 @@ output "ocids" {
88
value = join(",", local.cluster_instances_ids)
99
}
1010
output "cluster_ocid" {
11-
value = var.cluster_network ? oci_core_cluster_network.cluster_network[0].id : oci_core_instance_pool.instance_pool[0].id
11+
value = var.compute_cluster ? oci_core_compute_cluster.compute_cluster[0].id : var.cluster_network ? oci_core_cluster_network.cluster_network[0].id : oci_core_instance_pool.instance_pool[0].id
1212
}

0 commit comments

Comments
 (0)