Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion a3/terraform/modules/cluster/mig-cos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ No resources.
| <a name="input_enable_install_gpu"></a> [enable\_install\_gpu](#input\_enable\_install\_gpu) | Setting this to false will disable a built-in startup script which:<br>- installs GPU drivers<br>- configures docker auth<br>- installs iptable rules<br>- installs NCCL and GPUDirectTCPX plugin<br><br>Any installation replacements should be in the startup\_script variable | `bool` | `true` | no |
| <a name="input_filestore_new"></a> [filestore\_new](#input\_filestore\_new) | Configurations to mount newly created network storage. Each object describes NFS file-servers to be hosted in Filestore.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#inputs).<br><br>------------<br>`filestore_new.filestore_tier`<br><br>The service tier of the instance.<br><br>Possible values: `["BASIC_HDD", "BASIC_SSD", "HIGH_SCALE_SSD", "ENTERPRISE"]`.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_filestore_tier), [gcloud](https://cloud.google.com/sdk/gcloud/reference/filestore/instances/create#--tier).<br><br>------------<br>`filestore_new.local_mount`<br><br>Mountpoint for this filestore instance.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_local_mount).<br><br>------------<br>`filestore_new.size_gb`<br><br>Storage size of the filestore instance in GB.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_local_mount), [gcloud](https://cloud.google.com/sdk/gcloud/reference/filestore/instances/create#--file-share).<br><br>------------<br>`filestore_new.zone`<br><br>Location for filestore instance.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_zone). | <pre>list(object({<br> filestore_tier = string<br> local_mount = string<br> size_gb = number<br> zone = string<br> }))</pre> | `[]` | no |
| <a name="input_gcsfuse_existing"></a> [gcsfuse\_existing](#input\_gcsfuse\_existing) | Configurations to mount existing network storage. Each object describes Cloud Storage Buckets to be mounted with Cloud Storage FUSE.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#inputs).<br><br>------------<br>`gcsfuse_existing.local_mount`<br><br>The mount point where the contents of the device may be accessed after mounting.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#input_local_mount).<br><br>------------<br>`gcsfuse_existing.remote_mount`<br><br>Bucket name without “gs://”.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#input_remote_mount). | <pre>list(object({<br> local_mount = string<br> remote_mount = string<br> }))</pre> | `[]` | no |
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | Required Fields:<br>- `target_size`: The number of running instances for this managed instance group. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#target_size), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--size).<br>- `zone`: The zone that instances in this group should be created in. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#zone), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--zone).<br>- `machine_type`: (Optional)The name of a Google Compute Engine machine type. There are [many possible values](https://cloud.google.com/compute/docs/machine-resource). Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#machine_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--machine-type).<br>- `existing_resource_policy_name`: (Optional) The existing resource policy. | <pre>list(object({<br> zone = string<br> target_size = number<br> machine_type = optional(string, "a3-highgpu-8g")<br> existing_resource_policy_name = optional(string, null)<br> }))</pre> | n/a | yes |
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | Required Fields:<br>- `target_size`: The number of running instances for this managed instance group. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#target_size), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--size).<br>- `zone`: The zone that instances in this group should be created in. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#zone), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--zone).<br>- `machine_type`: (Optional)The name of a Google Compute Engine machine type. There are [many possible values](https://cloud.google.com/compute/docs/machine-resource). Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#machine_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--machine-type). | <pre>list(object({<br> zone = string<br> target_size = number<br> machine_type = optional(string, "a3-highgpu-8g")<br> compact_placement_policy = optional(object({<br> new_policy = optional(bool, false)<br> existing_policy_name = optional(string)<br> specific_reservation = optional(string)<br> }))<br> }))</pre> | n/a | yes |
| <a name="input_labels"></a> [labels](#input\_labels) | The resource labels (a map of key/value pairs) to be applied to the GPU cluster.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#labels), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--labels). | `map(string)` | `{}` | no |
| <a name="input_machine_image"></a> [machine\_image](#input\_machine\_image) | The image with which this disk will initialize. This image must be in the project `cos-cloud`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#source_image).<br><br>------------<br>`machine_image.family`<br><br>The family of images from which the latest non-deprecated image will be selected. Conflicts with `machine_image.name`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-family).<br><br>------------<br>`machine_image.name`<br><br>The name of a specific image. Conflicts with `machine_image.family`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image).<br><br>------------<br>`machine_image.project`<br><br>The project\_id to which this image belongs.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#project), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-project). | <pre>object({<br> family = string<br> name = string<br> project = string<br> })</pre> | <pre>{<br> "family": "cos-stable",<br> "name": null,<br> "project": "cos-cloud"<br>}</pre> | no |
| <a name="input_maintenance_interval"></a> [maintenance\_interval](#input\_maintenance\_interval) | Specifies the frequency of planned maintenance events. 'PERIODIC' is th only supported value for maintenance\_interval.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#maintenance_interval). | `string` | `null` | no |
Expand Down
31 changes: 15 additions & 16 deletions a3/terraform/modules/cluster/mig-cos/main.tf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tests need to be modified for this.

Original file line number Diff line number Diff line change
Expand Up @@ -70,22 +70,21 @@ module "compute_instance_template" {
source = "../../common/instance_template"
count = length(var.instance_groups)

disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
machine_image = var.machine_image
machine_type = var.instance_groups[count.index].machine_type
maintenance_interval = var.maintenance_interval
metadata = local.metadata
project_id = var.project_id
region = var.region
resource_prefix = var.resource_prefix
service_account = var.service_account
use_compact_placement_policy = var.use_compact_placement_policy
existing_resource_policy_name = var.instance_groups[count.index].existing_resource_policy_name
startup_script = null
subnetwork_self_links = module.network.subnetwork_self_links
network_self_links = module.network.network_self_links
labels = merge(var.labels, { ghpc_role = "compute" })
compact_placement_policy = var.instance_groups[count.index].compact_placement_policy
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
machine_image = var.machine_image
machine_type = var.instance_groups[count.index].machine_type
maintenance_interval = var.maintenance_interval
metadata = local.metadata
project_id = var.project_id
region = var.region
resource_prefix = var.resource_prefix
service_account = var.service_account
startup_script = null
subnetwork_self_links = module.network.subnetwork_self_links
network_self_links = module.network.network_self_links
labels = merge(var.labels, { ghpc_role = "compute" })
}

module "compute_instance_group_manager" {
Expand Down
13 changes: 8 additions & 5 deletions a3/terraform/modules/cluster/mig-cos/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,16 @@ variable "instance_groups" {
- `target_size`: The number of running instances for this managed instance group. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#target_size), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--size).
- `zone`: The zone that instances in this group should be created in. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#zone), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--zone).
- `machine_type`: (Optional)The name of a Google Compute Engine machine type. There are [many possible values](https://cloud.google.com/compute/docs/machine-resource). Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#machine_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--machine-type).
- `existing_resource_policy_name`: (Optional) The existing resource policy.
EOT
type = list(object({
zone = string
target_size = number
machine_type = optional(string, "a3-highgpu-8g")
existing_resource_policy_name = optional(string, null)
zone = string
target_size = number
machine_type = optional(string, "a3-highgpu-8g")
compact_placement_policy = optional(object({
new_policy = optional(bool, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please remove use_compact_placement_policy

existing_policy_name = optional(string)
specific_reservation = optional(string)
}))
}))
nullable = false

Expand Down
3 changes: 1 addition & 2 deletions a3/terraform/modules/common/instance_template/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ No requirements.

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_compact_placement_policy"></a> [compact\_placement\_policy](#input\_compact\_placement\_policy) | The flag to create and use a superblock level compact placement policy for the instances. Currently GCE supports using only 1 placement policy.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#resource_policies). | <pre>object({<br> new_policy = optional(bool, false)<br> existing_policy_name = optional(string)<br> specific_reservation = optional(string)<br> })</pre> | `null` | no |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | The size of the image in gigabytes for the boot disk of each instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--boot-disk-size). | `number` | n/a | yes |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | The GCE disk type for the boot disk of each instance.<br><br>Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--boot-disk-type). | `string` | n/a | yes |
| <a name="input_existing_resource_policy_name"></a> [existing\_resource\_policy\_name](#input\_existing\_resource\_policy\_name) | The name of the existing resource policy. <br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_resource_policy#name). | `string` | `null` | no |
| <a name="input_labels"></a> [labels](#input\_labels) | A set of key/value label pairs to assign to instances created from this template.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#labels), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--labels). | `map(string)` | n/a | yes |
| <a name="input_machine_image"></a> [machine\_image](#input\_machine\_image) | The image with which this disk will initialize.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#source_image).<br><br>------------<br>`machine_image.family`<br><br>The family of images from which the latest non-deprecated image will be selected. Conflicts with `machine_image.name`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-family).<br><br>------------<br>`machine_image.name`<br><br>The name of a specific image. Conflicts with `machine_image.family`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image).<br><br>------------<br>`machine_image.project`<br><br>The project\_id to which this image belongs.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#project), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-project). | <pre>object({<br> family = string<br> name = string<br> project = string<br> })</pre> | n/a | yes |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | The name of a Google Compute Engine machine type. There are [many possible values](https://cloud.google.com/compute/docs/machine-resource).<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#machine_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--machine-type). | `string` | n/a | yes |
Expand All @@ -57,7 +57,6 @@ No requirements.
| <a name="input_service_account"></a> [service\_account](#input\_service\_account) | Service account to attach to the instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#service_account).<br><br>------------<br>`service_account.email`<br><br>The service account e-mail address. If not given, the default Google Compute Engine service account is used.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#email), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--service-account).<br><br>------------<br>`service_account.scopes`<br><br>A list of service scopes. Both OAuth2 URLs and gcloud short names are supported. To allow full access to all Cloud APIs, use the `"cloud-platform"` scope. See a complete list of scopes [here](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/instances/set-scopes#--scopes).<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#scopes), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--scopes). | <pre>object({<br> email = string,<br> scopes = set(string)<br> })</pre> | n/a | yes |
| <a name="input_startup_script"></a> [startup\_script](#input\_startup\_script) | Script to run at boot on each instance. This is here for convenience and will just be appended to `metadata` under the key `"startup-script"`. | `string` | n/a | yes |
| <a name="input_subnetwork_self_links"></a> [subnetwork\_self\_links](#input\_subnetwork\_self\_links) | The subnet self-links for all the VPCs. | `list(string)` | n/a | yes |
| <a name="input_use_compact_placement_policy"></a> [use\_compact\_placement\_policy](#input\_use\_compact\_placement\_policy) | The flag to create and use a superblock level compact placement policy for the instances. Currently GCE supports using only 1 placement policy.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#resource_policies). | `bool` | `false` | no |
| <a name="input_use_static_naming"></a> [use\_static\_naming](#input\_use\_static\_naming) | Flag to determine whether to use static naming for instance\_template name. If used static naming, then instance\_template cannot be updated. it needs to be destroyed and then recreated.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#name_prefix). | `bool` | `false` | no |

## Outputs
Expand Down
Loading