Skip to content

Latest commit

 

History

History
1994 lines (1468 loc) · 88.3 KB

File metadata and controls

1994 lines (1468 loc) · 88.3 KB

Example Blueprints

AI Hypercomputer

Additional blueprints optimized for AI workloads on modern GPUs is available at Google Cloud AI Hypercomputer. Documentation is available for GKE and for Slurm.

Instructions

Ensure project_id, zone, and region deployment variables are set correctly under vars before using an example blueprint.

NOTE: Deployment variables defined under vars are automatically passed to modules if the modules have an input that matches the variable name.

(Optional) Setting up a remote terraform state

There are two ways to specify terraform backends in Cluster Toolkit: a default setting that propagates all groups and custom per-group configuration:

  • terraform_backend_defaults at top-level of YAML blueprint
  • terraform_backend within a deployment group definition

Examples of each are shown below. If both settings are used, then the custom per-group value is used without modification.

The following block will configure terraform to point to an existing GCS bucket to store and manage the terraform state. Add your own bucket name in place of <<BUCKET_NAME>> and (optionally) a service account in place of <<SERVICE_ACCOUNT>> in the configuration. If not set, the terraform state will be stored locally within the generated deployment directory.

Add this block to the top-level of your blueprint:

terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: <<BUCKET_NAME>>
    impersonate_service_account: <<SERVICE_ACCOUNT>>

All Terraform-supported backends are supported by the Toolkit. Specify the backend using type and its configuration block using configuration.

For the gcs backend, you must minimally supply the bucket configuration setting. The prefix setting is generated automatically as "blueprint_name/deployment_name/group_name" for each deployment group. This ensures uniqueness.

If you wish to specify a custom prefix, use a unique value for each group following this example:

deployment_groups:
- group: example_group
  terraform_backend:
    type: gcs
    configuration:
      bucket: your-bucket
      prefix: your/object/prefix

You can set the configuration using the CLI in the create and expand subcommands as well:

./gcluster create examples/hpc-slurm.yaml \
  --vars "project_id=${GOOGLE_CLOUD_PROJECT}" \
  --backend-config "bucket=${GCS_BUCKET}"

NOTE: The --backend-config argument supports comma-separated list of name=value variables to set Terraform Backend configuration in blueprints. This feature only supports variables of string type. If you set configuration in both the blueprint and CLI, the tool uses values at CLI. "gcs" is set as type by default.

Blueprint Descriptions

The example blueprints listed below labeled with the core badge (core-badge) are located in this folder and are developed and tested by the Cluster Toolkit team directly.

The community blueprints are contributed by the community (including the Cluster Toolkit team, partners, etc.) and are labeled with the community badge (community-badge). The community blueprints are located in the community folder.

Blueprints that are still in development and less stable are also labeled with the experimental badge (experimental-badge).

Creates a basic auto-scaling Slurm cluster with mostly default settings. The blueprint also creates a new VPC network, and a filestore instance mounted to /home.

There are 3 partitions in this example: debug compute, and h3. The debug partition uses n2-standard-2 VMs, which should work out of the box without needing to request additional quota. The purpose of the debug partition is to make sure that first time users are not immediately blocked by quota limitations.

Compute Partition

There is a compute partition that achieves higher performance. Any performance analysis should be done on the compute partition. By default it uses c2-standard-60 VMs with placement groups enabled. You may need to request additional quota for C2 CPUs in the region you are deploying in. You can select the compute partition using the -p compute argument when running srun.

H3 Partition

There is an h3 partition that uses compute-optimized h3-standard-88 machine type. You can read more about the H3 machine series here.

Quota Requirements for hpc-slurm.yaml

For this example the following is needed in the selected region:

  • Cloud Filestore API: Basic HDD (Standard) capacity (GB): 1,024 GB
  • Compute Engine API: Persistent Disk SSD (GB): ~50 GB
  • Compute Engine API: Persistent Disk Standard (GB): ~50 GB static + 50 GB/node up to 1,250 GB
  • Compute Engine API: N2 CPUs: 2 for the login node and 2/node active in the debug partition up to 12
  • Compute Engine API: C2 CPUs: 4 for the controller node and 60/node active in the compute partition up to 1,204
  • Compute Engine API: H3 CPUs: 88/node active in the h3 partition up to 1760
    • The H3 CPU quota can be increased on the Cloud Console by navigating to IAM & Admin->Quotas or searching All Quotas and entering vm_family:H3 into the filter bar. From there, the quotas for each region may be selected and edited.
  • Compute Engine API: Affinity Groups: one for each job in parallel - only needed for the compute partition
  • Compute Engine API: Resource policies: one for each job in parallel - only needed for the compute partition

This advanced blueprint creates a cluster with Slurm with several performance tunings enabled, along with tiered file systems for higher performance. Some of these features come with additional cost and required additional quotas.

The Slurm system deployed here connects to the default VPC of the project and creates a login node and the following seven partitions:

  • n2 with general-purpose n2-standard-2 nodes. Placement policies and exclusive usage are disabled, which means the nodes can be used for multiple jobs. Nodes will remain idle for 5 minutes before Slurm deletes them. This partition can be used for debugging and workloads that do not require high performance.
  • c2 with compute-optimized c2-standard-60 nodes based on Intel 3.9 GHz Cascade Lake processors.
  • c2d with compute optimized c2d-standard-112 nodes base on the third generation AMD EPYC Milan.
  • c3 with compute-optimized c3-highcpu-176 nodes based on Intel Sapphire Rapids processors. When configured with Tier_1 networking, C3 nodes feature 200 Gbps low-latency networking.
  • h3 with compute-optimized h3-standard-88 nodes based on Intel Sapphire Rapids processors. H3 VMs can use the entire host network bandwidth and come with a default network bandwidth rate of up to 200 Gbps.
  • a208 with a2-ultragpu-8g nodes with 8 of the NVIDIA A100 GPU accelerators with 80GB of GPU memory each.
  • a216 with a2-megagpu-16g nodes with 16 of the NVIDIA A100 GPU accelerators with 40GB of GPU memory each.

For all partitions other than n2, compact placement policies are enabled by default and nodes are created and destroyed on a per-job basis. Furthermore, these partitions are configured with:

  • Faster networking: Google Virtual NIC (GVNIC) is used for the GPU partitions and Tier_1 is selected when available. Selecting Tier_1 automatically enables GVNIC.
  • SSD PDs disks for compute nodes. See the Storage options page for more details.

File systems:

  • The homefs mounted at /home uses the "BASIC_SSD" tier filestore with 2.5 TiB of capacity
  • The projectsfs is mounted at /projects and is a high scale SSD filestore instance with 10TiB of capacity.
  • The lustre-gcp is mounted at /lustre and is designed for the highly parallel and random I/O performance. It has a minimum capacity of ~18TiB. GCP Managed Lustre module

Quota Requirements for hpc-enterprise-slurm.yaml

For this example the following is needed in the selected region:

  • Cloud Filestore API: Basic SSD capacity (GB) per region: 2,560 GB
  • Cloud Filestore API: High Scale SSD capacity (GB) per region: 10,240 GiB - min quota request is 61,440 GiB
  • Compute Engine API: Persistent Disk SSD (GB): ~14,050 GB static + 100 GB/node up to 23,250 GB
  • Compute Engine API: Persistent Disk Standard (GB): ~396 GB static + 50 GB/node up to 596 GB
  • Compute Engine API: N2 CPUs: 116 for login and lustre and 2/node active in n2 partition up to 124.
  • Compute Engine API: C2 CPUs: 4 for controller node and 60/node active in c2 partition up to 1,204
  • Compute Engine API: C2D CPUs: 112/node active in c2d partition up to 2,240
  • Compute Engine API: C3 CPUs: 176/node active in c3 partition up to 3,520
  • Compute Engine API: H3 CPUs: 88/node active in h3 partition up to 1,408
  • Compute Engine API: A2 CPUs: 96/node active in a208 and a216 partitions up to 3,072
  • Compute Engine API: NVIDIA A100 80GB GPUs: 8/node active in a208 partition up to 128
  • Compute Engine API: NVIDIA A100 GPUs: 8/node active in a216 partition up to 256
  • Compute Engine API: Resource policies: one for each job in parallel - not needed for n2 partition

This example demonstrates how to create a partition with static compute nodes. See Best practices for static compute nodes for instructions on setting up a reservation and compact placement policy.

Before deploying this example the following fields must be populated in the bluerpint:

  project_id: ## Set GCP Project ID Here ##
  static_reservation_name:  ## Set your reservation name here ##
  static_reservation_machine_type: ## Machine must match reservation above ##
  static_node_count: ## Must be <= number of reserved machines ##

For more resources on static compute nodes see the following cloud docs pages:

For a similar, more advanced, example which demonstrates static node functionality with GPUs, see the ML Slurm A3 example.

Creates an auto-scaling Slurm cluster with TPU nodes.

Creates an auto-scaling Slurm cluster with TPU nodes.

For tutorial on how to run maxtext workload on TPU partition using Slurm, Follow hpc-slurm-tpu-maxtext.

This blueprint creates a custom Apptainer enabled image and builds an auto-scaling Slurm cluster using that image. You can deploy containerized workloads on that cluster as described here.

This blueprint deploys a cluster containing a pair of h4d-highmem-192-lssd VMs with RDMA networking enabled along with a filestore instance mounted to /home.

This blueprint provisions an HPC cluster running the Slurm scheduler with the machine learning frameworks PyTorch and TensorFlow pre-installed on every VM. The cluster has 2 partitions:

To provision the cluster, please run:

./gcluster create examples/ml-slurm.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy ml-example-v6

After accessing the login node, you can activate the conda environment for each library with:

source /etc/profile.d/conda.sh
# to activate PyTorch
conda activate pytorch
# to activate TensorFlow
conda activate tf

An example benchmarking job for PyTorch can be run under Slurm:

cp /var/tmp/torch_test.* .
sbatch -N 1 --gpus-per-node=1 torch_test.sh

When you are done, clean up the resources in reverse order of creation:

./gcluster destroy ml-example-v6

Finally, browse to the Cloud Console to delete your custom image. It will be named beginning with ml-slurm followed by a date and timestamp for uniqueness.

This blueprint uses the Packer template module to create a custom VM image and uses it to provision an HPC cluster using the Slurm scheduler. By using a custom image, the cluster is able to begin running jobs sooner and more reliably because there is no need to install applications as VMs boot. This example takes the following steps:

  1. Creates a network with outbound internet access in which to build the image (see Custom Network).
  2. Creates a script that will be used to customize the image (see Toolkit Runners).
  3. Builds a custom Slurm image by executing the script on a standard Slurm image (see Packer Template).
  4. Deploys a Slurm cluster using the custom image (see Slurm Cluster Based on Custom Image).

Quota Requirements for image-builder.yaml

For this example the following is needed in the selected region:

  • Compute Engine API: Images (global, not regional quota): 1 image per invocation of packer build
  • Compute Engine API: Persistent Disk SSD (GB): ~50 GB
  • Compute Engine API: Persistent Disk Standard (GB): ~64 GB static + 32 GB/node up to 704 GB
  • Compute Engine API: N2 CPUs: 4 (for short-lived Packer VM and Slurm login node)
  • Compute Engine API: C2 CPUs: 4 for controller node and 60/node active in compute partition up to 1,204
  • Compute Engine API: Affinity Groups: one for each job in parallel - only needed for compute partition
  • Compute Engine API: Resource policies: one for each job in parallel - only needed for compute partition

Building and using the custom image

Create the deployment folder from the blueprint:

./gcluster create examples/image-builder.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy image-builder-v6-001

Follow the on-screen prompts to approve the creation of each deployment group. For example, the network is created in the first deployment group, the VM image is created in the second group, and the third group uses the image to create an HPC cluster using the Slurm scheduler.

Why use a custom image?

Using a custom VM image can be more scalable and reliable than installing software using boot-time startup scripts because:

  • It avoids reliance on continued availability of package repositories.
  • VMs will join an HPC cluster and execute workloads more rapidly due to reduced boot-time configuration.
  • Machines are guaranteed to boot with a static software configuration chosen when the custom image was created. No potential for some machines to have different software versions installed due to apt/yum/pip installations executed after remote repositories have been updated.

Custom Network (deployment group 1)

A tool called Packer builds custom VM images by creating short-lived VMs, executing scripts on them, and saving the boot disk as an image that can be used by future VMs. The short-lived VM typically operates in a network that has outbound access to the internet for downloading software.

This deployment group creates a network using Cloud Nat and Identity-Aware Proxy (IAP) to allow outbound traffic and inbound SSH connections without exposing the machine to the internet on a public IP address.

Toolkit Runners (deployment group 1)

The Toolkit startup-script module supports boot-time configuration of VMs using "runners". Runners are configured as a series of scripts uploaded to Cloud Storage. A simple, standard VM startup script runs at boot-time, downloads the scripts from Cloud Storage and executes them in sequence.

The script in this example performs the trivial task of creating a file as a simple demonstration of functionality. You can use the startup-script module to address more complex scenarios.

Packer Template (deployment group 2)

The Packer module uses the startup-script module from the first deployment group and executes the script to produce a custom image.

Slurm Cluster Based on Custom Image (deployment group 3)

Once the Slurm cluster has been deployed we can test that our Slurm compute partition is using the custom image. Each compute node should contain the hello.txt file added by the startup-script.

  1. SSH into the login node imagebuild-slurm-login-001.
  2. Run a job that prints the contents of the added file:
$ srun -N 2 cat /usr/local/hello.txt
Hello World
Hello World

To avoid recurring charges for the resources provisioned by Cluster Toolkit, clean up the resources using the following command:

./gcluster destroy image-builder-v6-001

Follow the on-screen prompts to approve the deletion of each deployment group. For example, the resources are removed in reverse order. The cluster is destroyed first, followed by the primary is destroyed in the deployment group.

Finally, browse to the Cloud Console to delete your custom image. It will be named beginning with my-slurm-image followed by a date and timestamp for uniqueness.

This example demonstrates how to use the Cluster Toolkit to set up a Google Cloud Batch job that mounts a Filestore instance and runs startup scripts.

The blueprint creates a Filestore and uses the startup-script module to mount and load "data" onto the shared storage. The batch-job-template module creates an instance template to be used for the Google Cloud Batch compute VMs and renders a Google Cloud Batch job template. A login node VM is created with instructions on how to SSH to the login node and submit the Google Cloud Batch job.

To provision the cluster, please run:

./gcluster create examples/serverless-batch.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy hello-workload

When you are done, clean up the resources in reverse order of creation:

./gcluster destroy hello-workload

This blueprint demonstrates how to use Spack to run a real MPI job on Batch.

The blueprint contains the following:

  • A shared filestore filesystem.
  • A spack-setup module that generates a script to install Spack
  • A spack-execute module that builds the WRF application onto the shared filestore.
  • A startup-script module which uses the above script and stages job data.
  • A builder vm-instance which performs the Spack install and then shuts down.
  • A batch-job-template that builds a Batch job to execute the WRF job.
  • A batch-login VM that can be used to test and submit the Batch job.

Usage instructions:

  1. Spack install

    After terraform apply completes, you must wait for Spack installation to finish before running the Batch job. You will observe that a VM named spack-builder-0 has been created. This VM will automatically shut down once Spack installation has completed. When using a Spack cache this takes about 25 minutes. Without a Spack cache this will take 2 hours. To view build progress or debug you can inspect /var/logs/messages and /var/log/spack.log on the builder VM.

  2. Access login node

    After the builder shuts down, you can ssh to the Batch login node named batch-wrf-batch-login. Instructions on how to ssh to the login node are printed to the terminal after a successful terraform apply. You can reprint these instructions by calling the following:

    terraform -chdir=batch-wrf/primary output instructions_batch-login

    Once on the login node you should be able to inspect the Batch job template found in the /home/batch-jobs directory. This Batch job will call a script found at /share/wrfv3/submit_wrfv3.sh. Note that the /share directory is shared between the login node and the Batch job.

  3. Submit the Batch job

    Use the command provided in the terraform output instructions to submit your Batch job and check its status. The Batch job may take several minutes to start and once running should complete within 5 minutes.

  4. Inspect results

    The Batch job will create a folder named /share/jobs/<unique id>. Once the job has finished this folder will contain the results of the job. You can inspect the rsl.out.0000 file for a summary of the job.

Creates a Managed Lustre file-system that is mounted in one client instance.

The GCP Managed Lustre file system is designed for high IO performance. It has a minimum capacity of ~18TiB and is mounted at /lustre.

After the creation of the file-system and the client instances, the startup scripts on the client instances will automatically install the lustre drivers, configure the mount-point, and mount the file system to the specified directory. This may take a few minutes after the VMs are created and can be verified by running:

watch df

Eventually you should see a line similar to:

<IP>:<remote_mount>  lustre   100G   15G  85G  15% <local_mount>

with remote_mount, and local_mount reflecting the settings of the module and IP being set to the lustre instance's IP.

Quota Requirements for pfs-managed-lustre-vm.yaml

For this example, the following is needed in the selected region:

  • Compute Engine API: Persistent Disk SSD (GB): ~800GB: 800GB MDT
  • Compute Engine API: Persistent Disk Standard (GB): ~328GB: 128 MGT, 200GB client-vm
  • Compute Engine API: Hyperdisk Balanced (GB): ~27432GB: 18432 GB OST Pool, 8*1125GB OST
  • Compute Engine API: N2 CPUs: ~34: 32 MGS, 2 client-vm
  • Compute Engine API: C3 CPUs: ~396: 44 MDS, 2*176 OSS

This Cluster Toolkit blueprint deploys a Google Kubernetes Engine (GKE) cluster integrated with Google Cloud Managed Lustre, providing a high-performance file system for demanding workloads.

Features

  • VPC Network: Sets up a new VPC, subnet, and secondary ranges for GKE pods and services.
  • Private Services Access: Configures Private Services Access, required for Managed Lustre.
  • Firewall Rules: Creates firewall rules to allow traffic between GKE nodes and the Managed Lustre instance (port 988).
  • Managed Lustre Instance: Provisions a Google Cloud Managed Lustre file system instance.
  • Service Accounts: Creates dedicated service accounts for GKE node pools and workloads with necessary IAM roles.
  • GKE Cluster: Deploys a GKE cluster with the Managed Lustre CSI driver enabled (enable_managed_lustre_csi: true).
  • Persistent Volume: Creates a Kubernetes PersistentVolume (PV) and PersistentVolumeClaim (PVC) to make the Managed Lustre instance accessible to pods.
  • GKE Node Pool: Sets up a node pool where application pods can run and mount the Lustre file system.

Requirements

  1. Cluster Toolkit: Ensure you have the Cluster Toolkit (gcluster) binary built and ready to use.
  2. GCP Project: A Google Cloud Project with necessary permissions to create VPCs, GKE clusters, Managed Lustre instances, and related resources.
  3. Quotas: Sufficient quotas for GCE, GKE, and Managed Lustre resources in the selected region. Note that Managed Lustre capacity and performance tiers have specific quota requirements. See Managed Lustre Performance Tiers and Quotas.
  4. GKE Version: The blueprint is configured for GKE version 1.33.x or later, as required by the Managed Lustre CSI driver.
  5. Location: Managed Lustre is only available in specific regions and zones. Verify and adjust based on Managed Lustre Locations.

Steps to deploy the blueprint

  1. Install Cluster Toolkit

    1. Install dependencies.
    2. Set up Cluster Toolkit.
  2. Switch to the Cluster Toolkit directory

    cd cluster-toolkit
  3. Get the IP address for your host machine

    curl ifconfig.me
  4. Update the vars block of the blueprint file

    1. project_id: ID of the project where you are deploying the cluster.
    2. deployment_name: Name of the deployment.
    3. region / zone: Ensure these support Managed Lustre.
    4. authorized_cidr: update the IP address in /32.
    5. size_gib: Capacity of the Managed Lustre instance in GiB.
    6. per_unit_storage_throughput: Throughput in MB/s per TiB. The combination of size and throughput must match a valid performance tier.
  5. Build the Cluster Toolkit binary

    make
  6. Provision the GKE cluster

    ./gcluster deploy examples/gke-managed-lustre.yaml

    This process can take several minutes as it provisions the VPC, GKE cluster, Managed Lustre instance, and configures the CSI driver.

Accessing and Using Managed Lustre

  1. Configure kubectl: After successful deployment, configure kubectl to connect to your new GKE cluster:

    gcloud container clusters get-credentials $(vars.deployment_name) \
    --region $(vars.region) \
    --project $(vars.project_id)

    Replace $(vars.deployment_name), $(vars.region), and $(vars.project_id) with the actual values from your blueprint.

  2. Verify PVC: Check that the PersistentVolumeClaim has been created and is Bound:

    kubectl get pvc

    You should see a PVC named [LUSTRE_INSTANCE_PVC] with STATUS: Bound

    Note : [LUSTRE_INSTANCE_PVC] depicts lustre_instance_id suffixed with -pvc.

  3. Example Pod: Create a file named lustre-client-pod.yaml to deploy a test pod that mounts the Lustre volume

    apiVersion: v1
    kind: Pod
    metadata:
      name: lustre-client-pod
    spec:
      containers:
      - name: app
        image: busybox
        command: ["/bin/sh", "-c", "sleep 36000"] # Keep container running
        volumeMounts:
        - mountPath: "/mnt/lustre"
          name: lustre-volume
      volumes:
      - name: lustre-volume
        persistentVolumeClaim:
          claimName: [LUSTRE_INSTANCE_PVC] # Matches the PVC name

    Note : [LUSTRE_INSTANCE_PVC] depicts lustre_instance_id suffixed with -pvc.

    Note: This is just an example job using busybox image.

  4. Deploy the Pod:

    kubectl apply -f lustre-pod.yaml
  5. Verify Mount: Once the pod is running, exec into it to check the mount:

    kubectl exec -it lustre-client-pod -- /bin/sh
    # Inside the pod:
    df -h /mnt/lustre
    mount | grep lustre

Clean Up

To destroy all resources created by this blueprint, run:

./gcluster destroy CLUSTER-NAME

Replace CLUSTER-NAME with the deployment_name used in blueprint vars block.

The Computer Aided Engineering (CAE) blueprint captures a reference architecture where the right cloud components are assembled to optimally cater to the requirements of computationally-intensive CAE workloads. Specifically, it is architected around Google Cloud’s VM families that provide a high memory bandwidth and a balanced memory/flop ratio, which is particularly useful for per-core licensed CAE software. The solution caters also to large CAE use cases, requiring multiple nodes that are tightly-coupled via MPI. Special high-memory shapes support even very memory-demanding workloads with up to 16GB/core. For file IO, different Google managed high performance NFS storage services are available. For very IO demanding workloads, third party parallel file systems can be integrated. The scheduling of the workloads is done by a workload manager.

The CAE blueprint is intended to be a starting point for more tailored explorations or installations of specific CAE codes, as provided by ISVs separately.

A detailed documentation is provided in this README.

Quota Requirements for cae-slurm.yaml

For this example the following is needed in the selected region:

  • Cloud Filestore API: Basic SSD capacity (GB) per region: 5,120 GB
  • Cloud Filestore API: High Scale SSD capacity (GB) per region: 10,240 GB
  • Compute Engine API: H3 CPUs: 88/node active in balance partition up to 880
  • Compute Engine API: C3-highmem CPUs: 176/node active in highmem partition up to 1,760
  • Compute Engine API: N1 CPUs: 8/node active in desktop node.
  • Compute Engine API: T4 GPUs: 1/node active in desktop node.
  • Compute Engine API: N2 CPUs: 8 for login and 16 for controller

This blueprint demonstrates how to use Cluster Toolkit to build a Slurm image on top of an existing image, hpc-rocky-linux-8 in the case of this example.

The blueprint contains 3 groups:

  1. The first group creates a network and generates the scripts that will install Slurm. This uses the Ansible Playbook contained in the Slurm on GCP repo.
  2. The second group executes the build using Packer to run the scripts from the first group. This can take ~30 min and will generate a custom Slurm image in your project.
  3. The third group deploys a demo cluster that uses the newly built image. For a real world use case the demo cluster can be swapped out for a more powerful slurm cluster from other examples.

Similar to the hpc-slurm.yaml example, but using Ubuntu 22.04 instead of CentOS 7. Other operating systems are supported by SchedMD for the the Slurm on GCP project and images are listed here. Only the examples listed in this page been tested by the Cluster Toolkit team.

The cluster will support 2 partitions named debug and compute. The debug partition is the default partition and runs on smaller n2-standard-2 nodes. The compute partition is not default and requires specifying in the srun command via the --partition flag. The compute partition runs on compute optimized nodes of type cs-standard-60. The compute partition may require additional quota before using.

Quota Requirements for hpc-slurm-ubuntu2204.yaml

For this example the following is needed in the selected region:

  • Cloud Filestore API: Basic HDD (Standard) capacity (GB): 1,024 GB
  • Compute Engine API: Persistent Disk SSD (GB): ~50 GB
  • Compute Engine API: Persistent Disk Standard (GB): ~50 GB static + 50 GB/node up to 1,250 GB
  • Compute Engine API: N2 CPUs: 12
  • Compute Engine API: C2 CPUs: 4 for controller node and 60/node active in compute partition up to 1,204
  • Compute Engine API: Affinity Groups: one for each job in parallel - only needed for compute partition
  • Compute Engine API: Resource policies: one for each job in parallel - only needed for compute partition

This example provisions a Slurm cluster using AMD VM machine types. It automates the initial setup of Spack, including a script that can be used to install the AMD Optimizing C/C++ Compiler (AOCC) and compile OpenMPI with AOCC. It is more extensively discussed in a dedicated README for AMD examples.

This example demonstrates several different ways to use Google Cloud Storage (GCS) buckets in the Cluster Toolkit. There are two buckets referenced in the example:

  1. A GCS bucket that is created by the Cluster Toolkit (id: new-bucket).
  2. A GCS bucket that is created externally from the Cluster Toolkit but referenced by the blueprint (id: existing-bucket).

The created VM (id: workstation) references these GCS buckets with the use field. On VM startup gcsfuse will be installed, if not already on the image, and both buckets will be mounted under the directory specified by the local_mount option.

The wait-for-startup module (id: wait) makes sure that terraform does not exit before the buckets have been mounted.

To use the blueprint you must supply the project id and the name of an existing bucket:

./gcluster create community/examples/client-google-cloud-storage.yaml \
  --vars project_id=<project_id> \
  --vars existing_bucket_name=<name_of_existing_bucket>

Note: The service account used by the VM must have access to the buckets (roles/storage.objectAdmin). In this example the service account will default to the default compute service account.

Warning: In this example the bucket is mounted by root during startup. Due to the way permissions are handled by gcsfuse this means that read or read/write permissions must be granted indiscriminantly for all users which could be a security concern depending on usage. To avoid this, you can manually mount as the user using the bucket (Read more).

Spack is an HPC software package manager. This example creates a small Slurm cluster with software installed using the spack-setup and spack-execute modules. The controller will install and configure spack, and install gromacs using spack. Spack is installed in a shared location (/sw) via filestore. This build leverages the startup-script module and can be applied in any cluster by using the output of spack-setup or startup-script modules.

The installation will occur as part of the Slurm startup-script, a warning message will be displayed upon SSHing to the login node indicating that configuration is still active. To track the status of the overall startup script, run the following command on the login node:

sudo tail -f /var/log/messages

Spack specific installation logs will be sent to the spack_log as configured in your blueprint, by default /var/log/spack.log in the login node.

sudo tail -f /var/log/spack.log

Once the Slurm and Spack configuration is complete, spack will be available on the login node. To use spack in the controller or compute nodes, the following command must be run first:

source /sw/spack/share/spack/setup-env.sh

To load the gromacs module, use spack:

spack load gromacs

NOTE: Installing spack compilers and libraries in this example can take hours to run on startup. To decrease this time in future deployments, consider including a spack build cache as described in the comments of the example.

Ramble is an experimentation framework which can drive the installation of software with Spack and create, execute, and analyze experiments using the installed software.

This example blueprint will deploy a Slurm cluster, install Spack and Ramble on it, and create a Ramble workspace (named gromacs). This workspace can be setup using:

NOTE: Since in this example installation of ramble is owned by spack-ramble user, you may consider running sudo -i -u spack-ramble first.

ramble workspace activate
ramble workspace setup

After setup is complete, the experiments can be executed using:

ramble workspace activate # If not active
ramble on

And after the experiments are complete, they can be analyzed using:

ramble workspace activate # If not active
ramble workspace analyze

The experiments defined by the workspace configuration are a 1, 2, 4, 8, and 16 node scaling study of the Lignocellulose benchmark for Gromacs.

This blueprint demonstrates the use of Slurm and Filestore, with compute nodes that have local ssd drives deployed.

Creates a basic auto-scaling Slurm cluster with mostly default settings. The blueprint also creates two new VPC networks, one configured for RDMA networking and the other for non-RDMA networking, along with two filestore instances mounted to /home and /apps. There is an h4d partition that uses compute-optimized h4d-highmem-192-lssd machine type.

The SchedMD Slinky Project deploys Slurm on Kubernetes. Slinky is particularly useful for:

  1. Those with a prefer a Slurm workload management paradigm, but a cloud-native operational experience
  2. Those who want the flexibility of running HPC jobs with either Kubernetes-based scheduling or Slurm-based scheduling, all on the same platform

This blueprint creates a simple Slinky installation on top of Google Kubernetes Engine, with the following notable deviations from the Slinky quickstart setup:

  1. Two nodesets are implemented, following the pattern of an HPC nodeset and a debug nodeset.
  2. A login node is implemented.
  3. A lightweight, GCP-native metrics/monitoring system is adopted, rather than the Slinky-documented cluster-local Kube Prometheus Stack.
  4. Node affinities for system components, the login node, and compute nodesets are more explicitly defined, to improve stability, control, and HPC hardware utilization.

While H3 compute-optimized VMs are used for the HPC nodeset, the machine type can easily be switched (including to GPU-accelerated instances).

In order to create a static Slurm nodeset, which only requires one configuration to scale in/out (the nodeset's replicas setting), this example blueprint uses:

  • Autoscaling GKE node pools (via initial_node_count)
  • Non-autoscaling Slurm nodesets (via replicas), which sit 1:1 on top of the GKE nodes If both of these settings were static, two changes would be required for nodeset scale outs - one at the Slurm level (nodeset replicas) and one at the infrastructure level (node pool node count) - so instead the node pool autoscales to "follow" the nodeset specification.

Scale in/out nodesets with a single kubectl command:

kubectl scale nodeset/slurm-compute-debug --replicas=5 -n slurm

Nodeset autoscaling is only possible with KEDA installation and configuration work, and this is not included in the example.

This blueprint demonstrates an advanced architecture that can be used to run GROMACS with GPUs and CPUs on Google Cloud. For full documentation, refer document.

This blueprint lets you create a high-throughput execution environment for Google Deepmind's AlphaFold 3 in your own GCP project. It uses the unmodified AlphaFold 3 package, and provides a best-practices mapping of it to Google Cloud, leveraging Google Cloud's HPC technology.

We provide two simple examples that serve as basic templates for different ways of interacting with the AlphaFold 3 solution:

  • A Simple Job Launcher bash script that takes an AlphaFold 3 json file input (for the Datapipeline step or the Inference step) and submits it for processing to the AlphaFold 3 autoscaling Slurm cluster.
  • A Simple Service Launcher that has a central Python script that runs a loop monitoring directories on a provided GCS bucket for input files and which can be started as a system daemon on the controller-node, not requiring any user interaction with the AlphaFold 3 environment.

Before using this solution, please review the AlphaFold 3 Model Parameter Terms of Use. Please check that you/your organization are eligible for obtaining the weights and that your use falls within the allowed terms and complies with the Prohibited Use Policy.

See the AF3 Solution README for more details.

This blueprint uses GKE to provision a Kubernetes cluster with a system node pool (included in gke-cluster module) and an autoscaling compute node pool. It creates a VPC configured to be used by a VPC native GKE cluster with subnet secondary IP ranges defined.

The gke-job-template module is used to create a job file that can be submitted to the cluster using kubectl and will run on the specified node pool.

Steps to deploy the blueprint

  1. Install Cluster Toolkit

    1. Install dependencies.
    2. Set up Cluster Toolkit.
  2. Switch to the Cluster Toolkit directory

    cd cluster-toolkit
  3. Get the IP address for your host machine

    curl ifconfig.me
  4. Update the vars block of the blueprint file

    1. project_id: ID of the project where you are deploying the cluster.
    2. deployment_name: Name of the deployment.
    3. authorized_cidr: update the IP address in /32.
  5. Build the Cluster Toolkit binary

    make
  6. Provision the GKE cluster

    ./gcluster deploy examples/hpc-gke.yaml
  7. Run the job

    1. Connect to your cluster

      gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION --project=PROJECT_ID
      • Update the CLUSTER_NAME to the deployment_name
      • Update the COMPUTE_REGION to the region used in blueprint vars
      • Update the PROJECT_ID to the project_id used in blueprint vars
    2. The output of the ./gcluster deploy on CLI includes a kubectl create command to create the job.

      kubectl create -f <job-yaml-path>

      This command creates a job that uses busybox image and prints Hello World. This result can be viewed by looking at the pod logs.

    3. List pods

      kubectl get pods
    4. Get the pod logs

      kubectl logs <pod-name>

Clean Up

To destroy all resources associated with creating the GKE cluster, from Cloud Shell run the following command:

./gcluster destroy CLUSTER-NAME

Replace CLUSTER-NAME with the deployment_name used in blueprint vars block.

This blueprint demonstrates how to set up a GPU GKE cluster using the Cluster Toolkit. It includes:

Warning: authorized_cidr variable must be entered for this example to work. See note below.

  • Creation of a regional GKE cluster.

  • Creation of an autoscaling GKE node pool with g2 machines. Note: This blueprint has also been tested with a2 machines, but as capacity is hard to find the example uses g2 machines which have better obtainability. If using with a2 machines it is recommended to first obtain an automatic reservation.

    Example settings for a2 look like:

    source: modules/compute/gke-node-pool
      use: [gke_cluster]
      settings:
        disk_type: pd-balanced
        machine_type: a2-highgpu-2g

    Users only need to provide machine type for standard ["a2", "a3" and "g2"] machine families, while the other settings like type, count , gpu_driver_installation_config will default to machine family specific values. More on this gke-node-pool

machine_type: n1-standard-1
guest_accelerator:
- type: nvidia-tesla-t4
  count: 1

Custom g2 pool with custom guest_accelerator config

machine_type: g2-custom-16-55296
disk_type: pd-balanced
guest_accelerator:
- type: nvidia-l4
  count: 1
  gpu_sharing_config:
    max_shared_clients_per_gpu: 2
    gpu_sharing_strategy: "TIME_SHARING"
  gpu_driver_installation_config:
    gpu_driver_version: "LATEST"
  • Configuration of the cluster using default drivers provided by GKE.
  • Creation of a job template yaml file that can be used to submit jobs to the GPU node pool.

Note: The Kubernetes API server will only allow requests from authorized networks. You must use the authorized_cidr variable to supply an authorized network which contains the IP address of the machine deploying the blueprint, for example --vars authorized_cidr=<your-ip-address>/32. This will allow Terraform to create the necessary DaemonSet on the cluster. You can use a service like whatismyip.com to determine your IP address.

Once you have deployed the blueprint, follow output instructions to fetch credentials for the created cluster and submit a job calling nvidia_smi.

This blueprint shows how to use different storage options with GKE in the toolkit.

Note

This blueprint also demonstrates support for Anywhere Cache. Anywhere Cache is a fully managed service that caches Cloud Storage data in Google Cloud. For each bucket, you can create a maximum of one cache per zone. For example, if a bucket is located in the us-east1 region, you could create a cache in us-east1-b and another cache in us-east1-c. For information on other parameters to enable anywhere cache, see Create a Cache For more information, see Anywhere Cache documentation.

The blueprint contains the following:

  • A K8s Job that uses a Filestore and a GCS bucket as shared file systems between pods.
  • A K8s Job that demonstrates different ephemeral storage options:
    • memory backed emptyDir
    • local SSD backed emptyDir
    • SSD persistent disk backed ephemeral volume
    • balanced persistent disk backed ephemeral volume

Note that when type local-ssd is used, the specified node pool must have local_ssd_count_ephemeral_storage specified.

When using either pd-ssd or pd-balanced ephemeral storage, a persistent disk will be created when the job is submitted. The disk will be automatically cleaned up when the job is deleted.

Note

The Kubernetes API server will only allow requests from authorized networks. The gke-persistent-volume module needs access to the Kubernetes API server to create a Persistent Volume and a Persistent Volume Claim. You must use the authorized_cidr variable to supply an authorized network which contains the IP address of the machine deploying the blueprint, for example --vars authorized_cidr=<your-ip-address>/32. You can use a service like whatismyip.com to determine your IP address.

This blueprint shows how to use managed hyperdisk storage options with GKE in the toolkit.

The blueprint contains the following:

  • A K8s Job that uses a managed hyperdisk storage volume option.
  • A K8s Job that demonstrates ML training workload with managed hyperdisk storage disk operation.
    • The sample training workload manifest will be generated under the gke-managed-hyperdisk/primary folder, as tensorflow-GUID.yaml
    • You can deploy this sample training workload using "kubectl apply -f tensorflow-GUID.yaml" to start the training

Warning: In this example blueprint, when storage type Hyperdisk-balanced, Hyperdisk-extreme or Hyperdisk-throughput is specified in gke-storage module. The lifecycle of the hyperdisk is managed by the blueprint. On glcuster destroy operation, the hyperdisk storage created will also be destroyed.

[!Note] The Kubernetes API server will only allow requests from authorized networks. The gke-cluster module needs access to the Kubernetes API server to create a Persistent Volume and a Persistent Volume Claim. You must use the authorized_cidr variable to supply an authorized network which contains the IP address of the machine deploying the blueprint, for example --vars authorized_cidr=<your-ip-address>/32. You can use a service like whatismyip.com to determine your IP address.

Steps to deploy the blueprint

  1. Install Cluster Toolkit

    1. Install dependencies.
    2. Set up Cluster Toolkit.
  2. Switch to the Cluster Toolkit directory

    cd cluster-toolkit
  3. Get the IP address for your host machine

    curl ifconfig.me
  4. Update the vars block of the blueprint file

    1. project_id: ID of the project where you are deploying the cluster.
    2. deployment_name: Name of the deployment.
    3. authorized_cidr: update the IP address in /32.
  5. Build the Cluster Toolkit binary

    make
  6. Provision the GKE cluster

    ./gcluster deploy examples/gke-managed-hyperdisk.yaml

Clean Up

To destroy all resources associated with creating the GKE cluster, from Cloud Shell run the following command:

./gcluster destroy CLUSTER-NAME

Replace CLUSTER-NAME with the deployment_name used in blueprint vars block.

Refer to AI Hypercomputer Documentation for instructions.

This blueprint shows how to provision a GKE cluster with A3 Mega machines in the toolkit. Deploy an A3 Mega GKE cluster for ML training has the steps documented.

After provisioning the cluster and the nodepool, the below components will be installed to enable GPUDirect for the A3 Mega machines.

  • NCCL plugin for GPUDirect TCPXO
  • NRI device injector plugin
  • Provide support for injecting GPUDirect required components(annotations, volumes, rxdm sidecar etc.) into the user workload in the form of Kubernetes Job.
    • Provide sample workload to showcase how it will be updated with the required components injected, and how it can be deployed.
    • Allow user to use the provided script to update their own workload and deploy.

Note

The Kubernetes API server will only allow requests from authorized networks. The gke-cluster module needs access to the Kubernetes API server to apply a manifest. You must use the authorized_cidr variable to supply an authorized network which contains the IP address of the machine deploying the blueprint, for example --vars authorized_cidr=<your-ip-address>/32. You can use a service like whatismyip.com to determine your IP address.

Troubleshooting

Externally Managed Environment Error

If you see an error saying: local-exec provisioner error or This environment is externally managed, please use a virtual environment. This error is caused due to a conflict between pip3 and the operating system's package manager (like apt on Debian/Ubuntu-based systems).

  ## One time step of creating the venv
  VENV_DIR=~/venvp3
  python3 -m venv $VENV_DIR
  ## Enter your venv.
  source $VENV_DIR/bin/activate

This blueprint shows how to provision a GKE cluster with A3 High machines in the toolkit.

After provisioning the cluster and the nodepool, the below components will be installed to enable GPUDirect for the A3 High machines.

  • NCCL plugin for GPUDirect TCPX
  • NRI device injector plugin
  • Provide support for injecting GPUDirect required components(annotations, volumes, rxdm sidecar etc.) into the user workload in the form of Kubernetes Job via a script.

Note

The Kubernetes API server will only allow requests from authorized networks. The gke-cluster module needs access to the Kubernetes API server to apply a manifest. You must use the authorized_cidr variable to supply an authorized network which contains the IP address of the machine deploying the blueprint, for example --vars authorized_cidr=<your-ip-address>/32. You can use a service like whatismyip.com to determine your IP address.

Troubleshooting

Externally Managed Environment Error

If you see an error saying: local-exec provisioner error or This environment is externally managed, please use a virtual environment. This error is caused due to a conflict between pip3 and the operating system's package manager (like apt on Debian/Ubuntu-based systems).

  ## One time step of creating the venv
  VENV_DIR=~/venvp3
  python3 -m venv $VENV_DIR
  ## Enter your venv.
  source $VENV_DIR/bin/activate

This blueprint provisions a GKE cluster with A3 High machines, pre-configured to support the GKE Inference Gateway. It automates the setup of necessary networking components, such as a proxy-only subnet, and installs the required Custom Resource Definitions (CRDs) on the cluster.

After successfully deploying this blueprint, you can proceed with deploying a sample workload with vLLM inferencing by following the official guide at Serve a model with GKE Inference Gateway.

This blueprint takes care of the initial infrastructure setup (e.g., network creation and CRD installation). You will need to follow the guide to install specific instances of InferencePool, HTTPRoute, and the Model Server deployment itself.

This folder holds multiple GKE blueprint examples that demonstrate different consumption options on GKE, covering hardware such as A3 Ultra (A3U), TPU v6e, and TPU 7x.

This blueprint provisions an auto-scaling HTCondor pool based upon the HPC VM Image.

Also see the tutorial, which walks through the use of this blueprint.

This blueprint provisions a cluster using the Slurm scheduler in a configuration tuned for the execution of many short-duration, loosely-coupled (non-MPI) jobs.

For more information see:

Monte Carlo Simulations for Value at Risk

This blueprint will take you through a tutorial on an FSI Value at Risk calculation using Cloud tools:

  • Batch
  • Pub/Sub
    • BigQuery pubsub subscription
  • BigQuery
  • Vertex AI Notebooks

See the full tutorial here.

This blueprint provisions an HPC cluster running Slurm for use with a Simcenter StarCCM+ tutorial.

The main tutorial is described on the Cluster Toolkit website.

This blueprint provisions a simple cluster for use with a Simcenter StarCCM+ tutorial.

The main tutorial is described on the Cluster Toolkit website.

This blueprint provisions a simple cluster for use with an Ansys Fluent tutorial.

The main tutorial is described on the Cluster Toolkit website.

The flux-cluster.yaml blueprint describes a flux-framework cluster where flux is deployed as the native resource manager.

See README

This blueprint demonstrates the use of the Slurm and Filestore modules in the service project of an existing Shared VPC. Before attempting to deploy the blueprint, one must first complete initial setup for provisioning Filestore in a Shared VPC service project. Depending on how the shared VPC was created one may have to perform a few additional manual steps to configure the VPC. One may need to create firewall rules allowing SSH to be able to access the controller and login nodes. Also since this blueprint doesn't use external IPs for compute nodes, one must needs to set up cloud nat and set up iap.

Now, one needs to update the blueprint to include shared vpc details. In the network configuration, update the details for shared vpc as mentioned below,

vars:
  project_id:  <service-project> # update /w the service project id in which shared network will be used.
  host_project_id: <host-project> # update /w the host project id in which shared network is created.
  deployment_name: hpc-small-shared-vpc
  region: us-central1
  zone: us-central1-c

deployment_groups:
- group: primary
  modules:
  - id: network1
    source: modules/network/pre-existing-vpc
    settings:
      project_id: $(vars.host_project_id)
      network_name: <shared-network> # update /w shared network name
      subnetwork_name: <shared-subnetwork> # update /w shared sub-net name

This example shows how TPU v6e cluster can be created and be used to run a job that requires TPU capacity on GKE. Additional information on TPU blueprint and associated changes are in this README.

This example shows how to set up an XPK-compatible GKE cluster - giving researchers a Slurm-like CLI experience but with lightweight Kueue and Kjob resources on the cluster side. The blueprint creates a low-cost, CPU-based XPK cluster, using a single n2-standard-32-2 slice.

Client-side installation of the XPK CLI is also required (see the prerequisites and installation in the XPK repository). Set gcloud config set compute/zone <zone> and gcloud config set project <project-id> to avoid their repeated inclusion in XPK commands.

Attach the Filestore instance for use in workloads and jobs with the xpk storage command:

python3 xpk.py storage attach xpk-01-homefs \
  --cluster=xpk-01 \
  --type=gcpfilestore \
  --auto-mount=true \
  --mount-point=/home \
  --mount-options="" \
  --readonly=false \
  --size=1024 \
  --vol=nfsshare

After blueprint provisioning, XPK CLI installation, and storage setup, users can run interactive shells, workloads, and jobs:

# Start an interactive shell (somewhat analogous to a Slurm login node)
python3 xpk.py shell --cluster xpk-01
# Submit a workload (Kueue-based)
python3 xpk.py workload create \
  --cluster xpk-01 \
  --num-slices=1 \
  --device-type=n2-standard-32-2 \
  --workload xpk-test-workload \
  --command="ls /home"
# Run and manage jobs (kjob-focused)
python3 xpk.py run --cluster xpk-01 your-script.sh
python3 xpk.py batch --cluster xpk-01 your-script.sh
python3 xpk.py info --cluster xpk-01

This blueprint uses GKE to provision a Kubernetes cluster and a H4D node pool, along with networks and service accounts. Information about H4D machines can be found here. The deployment instructions can be found in the README.

This blueprint uses GKE to provision a Kubernetes cluster and a G4 node pool, along with networks and service accounts. Information about G4 machines can be found here. The deployment instructions can be found in the README.

This blueprint demonstrates how to provision NFS volumes as shared filesystems for compute VMs, using Google Cloud NetApp Volumes. It can be used as an alternative to FileStore in blueprints.

NetApp Volumes is a first-party Google service that provides NFS and/or SMB shared file-systems to VMs. It offers advanced data management capabilities and highly scalable capacity and performance.

NetApp Volume provides:

  • robust support for NFSv3, NFSv4.x and SMB 2.1 and 3.x
  • a rich feature set
  • scalable performance
  • FlexCache: Caching of ONTAP-based volumes to provide high-throughput and low latency read access to compute clusters of on-premises data
  • Auto-tiering of unused data to optimse cost

Support for NetApp Volumes is split into two modules.

  • netapp-storage-pool provisions a storage pool. Storage pools are pre-provisioned storage capacity containers which host volumes. A pool also defines fundamental properties of all the volumes within, like the region, the attached network, the service level, CMEK encryption, Active Directory and LDAP settings.
  • netapp-volume provisions a volume inside an existing storage pool. A volume is a file-system which is shared using NFS or SMB. It provides advanced data management capabilities.

You can provision multiple volumes in a pool. For service levels Standard, Premium and Extreme the throughput capability depends on volume size and service level. Every GiB of provisioned volume space adds 16/64/128 KiBps of throughput capability.

Steps to deploy the blueprint

To provision the bluebrint, please run:

./gcluster create examples/netapp-volumes.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}" --vars region=us-central1 --vars zone=us-central1-a
./gcluster deploy netapp-volumes

After the blueprint deployed, you can login to the VM created:

gcloud compute ssh --zone "us-central1-a" "netapp-volumes-0" --project ${GOOGLE_CLOUD_PROJECT} --tunnel-through-iap

A NetApp Volumes volume is provisioned and mounted to /home in all the provisioned VMs. A home directory for your user is created automatically:

pwd
df -h -t nfs

Clean Up

To destroy all resources associated with creating the GKE cluster, run the following command:

./gcluster destroy netapp-volumes

This example shows how TPU 7x cluster can be created and be used to run a job that requires TPU capacity on GKE. Additional information on TPU blueprint and associated changes are in this README.

This blueprint demonstrates how to use the gcloud community module to run arbitrary gcloud commands during deployment and destroy. It shows an example of creating and deleting a network, subnet, and VM instance.

Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also creates two new VPC networks, a network called eda-net which connects VMs, Slurm and storage and a RDMA network called eda-rdma-net between the H4D nodes, along with four Google Cloud NetApp Volumes mounted to /home, /tools, /library and /scratch. There is an h4d partition that uses compute-optimized h4d-highmem-192-lssd machine type.

The deployment instructions can be found in the README.

Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also connects to one existing user network which connects VMs, Slurm and storage and creates a RDMA network called eda-rdma-net for low latency communication between the compute nodes. There is an h4d partition that uses compute-optimized h4d-highmem-192-lssd machine type.

Four pre-existing NFS volumes are mounted to /home, /tools, /library and /scratch. Using FlexCache volumes allows to bring on-premises data to Google Cloud compute, without having to manually copy the data. This enables "burst to the cloud" use cases.

The deployment instructions can be found in the README.

Blueprint Schema

Similar documentation can be found on Google Cloud Docs.

A user defined blueprint should follow the following schema:

# Required: Name your blueprint.
blueprint_name: my-blueprint-name

# Top-level variables, these will be pulled from if a required variable is not
# provided as part of a module. Any variables can be set here by the user,
# labels will be treated differently as they will be applied to all created
# GCP resources.
vars:
  # Required: This will also be the name of the created deployment directory.
  deployment_name: first_deployment
  project_id: GCP_PROJECT_ID

# https://cloud.google.com/compute/docs/regions-zones
  region: us-central1
  zone: us-central1-a

# https://cloud.google.com/resource-manager/docs/creating-managing-labels
  labels:
    global_label: label_value

# Many modules can be added from local and remote directories.
deployment_groups:
- group: groupName
  modules:
  # Embedded module (part of the toolkit), prefixed with `modules/` or `community/modules`
  - id: <a unique id> # Required: Name of this module used to uniquely identify it.
    source: modules/role/module-name # Required
    kind: < terraform | packer > # Optional: Type of module, currently choose from terraform or packer. If not specified, `kind` will default to `terraform`
    # Optional: All configured settings for the module. For terraform, each
    # variable listed in variables.tf can be set here, and are mandatory if no
    # default was provided and are not defined elsewhere (like the top-level vars)
    settings:
      setting1: value1
      setting2:
        - value2a
        - value2b
      setting3:
        key3a: value3a
        key3b: value3b

  # GitHub module over SSH, prefixed with git@github.com
  - source: git@github.com:org/repo.git//path/to/module

  # GitHub module over HTTPS, prefixed with github.com
  - source: github.com/org/repo//path/to/module

  # Local absolute source, prefixed with /
  - source: /path/to/module

  # Local relative (to current working directory) source, prefixed with ./ or ../
  - source: ../path/to/module
  # NOTE: Do not reference toolkit modules by local source, use embedded source instead.

Writing an HPC Blueprint

The blueprint file is composed of 3 primary parts, top-level parameters, deployment variables and deployment groups. These are described in more detail below.

Blueprint Boilerplate

The following is a template that can be used to start writing a blueprint from scratch.

---
blueprint_name: # boilerplate-blueprint
toolkit_modules_url: # github.com/GoogleCloudPlatform/cluster-toolkit
toolkit_modules_version: # v1.38.0

vars:
  project_id: # my-project-id
  deployment_name: # boilerplate-001
  region: us-central1
  zone: us-central1-a

deployment_groups:
- group: primary
  modules:
  - id: # network1
    source: # modules/network/vpc

Top Level Parameters

  • blueprint_name (required): This name can be used to track resources and usage across multiple deployments that come from the same blueprint. blueprint_name is used as a value for the ghpc_blueprint label key, and must abide to label value naming constraints: blueprint_name must be at most 63 characters long, and can only contain lowercase letters, numeric characters, underscores and dashes.

  • toolkit_modules_url and toolkit_modules_version (optional): The blueprint schema provides the optional fields toolkit_modules_url and toolkit_modules_version to version a blueprint. When these fields are provided, any module in the blueprint with a reference to an embedded module in its source field will be updated to reference the specified GitHub source and toolkit version in the deployment folder. toolkit_modules_url specifies the base URL of the GitHub repository containing the modules and toolkit_modules_version specifies the version of the modules to use. toolkit_modules_url and toolkit_modules_version should be provided together when in use.

Deployment Variables

vars:
  region: "us-west-1"
  labels:
    "user-defined-deployment-label": "slurm-cluster"
  ...

Deployment variables are set under the vars field at the top level of the blueprint file. These variables can be explicitly referenced in modules as Blueprint Variables. Any module setting (inputs) not explicitly provided and matching exactly a deployment variable name will automatically be set to these values.

Deployment variables should be used with care. Module default settings with the same name as a deployment variable and not explicitly set will be overwritten by the deployment variable.

Deployment Variable "labels"

The “labels” deployment variable is a special case as it will be appended to labels found in module settings, whereas normally an explicit module setting would be left unchanged. This ensures that deployment-wide labels can be set alongside module specific labels. Precedence is given to the module specific labels if a collision occurs. Default module labels will still be overwritten by deployment labels.

The Cluster Toolkit uses special reserved labels for monitoring each deployment. These are set automatically, but can be overridden in vars or module settings. They include:

  • ghpc_blueprint: The name of the blueprint the deployment was created from
  • ghpc_deployment: The name of the specific deployment
  • ghpc_role: See below

A module role is a default label applied to modules (ghpc_role), which conveys what role that module plays within a larger HPC environment.

The modules provided with the Cluster Toolkit have been divided into roles matching the names of folders in the modules/ and community/modules directories (compute, file-system etc.).

When possible, custom modules should use these roles so that they match other modules defined by the toolkit. If a custom module does not fit into these roles, a new role can be defined.

Deployment Groups

Deployment groups allow distinct sets of modules to be defined and deployed as a group. A deployment group can only contain modules of a single kind, for example a deployment group may not mix packer and terraform modules.

For terraform modules, a top-level main.tf will be created for each deployment group so different groups can be created or destroyed independently.

A deployment group is made of 2 fields, group and modules. They are described in more detail below.

Group

Defines the name of the group. Each group must have a unique name. The name will be used to create the subdirectory in the deployment directory.

Modules

Modules are the building blocks of an HPC environment. They can be composed in a blueprint file to create complex deployments. Several modules are provided by default in the modules folder.

To learn more about how to refer to a module in a blueprint file, please consult the modules README file.

Variables, expressions, and functions

Variables can be used to refer both to values defined elsewhere in the blueprint and to the output and structure of other modules.

Note

"Brackets-less" access to elements of collection is not supported, use brackets. E.g. pink.lime[0].salmon instead of pink.lime.0.salmon.

Blueprint expressions

Expressions in a blueprint file can refer to deployment variables or the outputs of other modules. The expressions can only be used within vars, module settings, and terraform_backend blocks. The entire expression is wrapped in $(), the syntax is as follows:

vars:
  zone: us-central1-a
  num_nodes: 2

deployment_groups:
  - group: primary
     modules:
       - id: resource1
         source: path/to/module/1
         ...
       - id: resource2
         source: path/to/module/2
         ...
         settings:
            key1: $(vars.zone)
            key2: $(resource1.name)
            # access nested fields
            key3: $(resource1.nodes[0].private_ip)
            # arithmetic expression
            key4: $(vars.num_nodes + 5)
            # string interpolation
            key5: $(resource1.name)_$(vars.zone)
            # multiline string interpolation
            key6: |
              #!/bin/bash
              echo "Hello $(vars.project_id) from $(vars.region)"
            # use a function, supported by Terraform
            key7: $(jsonencode(resource1.config))

Escape expressions

Under circumstances where the expression notation conflicts with the content of a setting or string, for instance when defining a startup-script runner that uses a subshell like in the example below, a non-quoted backslash (\) can be used as an escape character. It preserves the literal value of the next character that follows: \$(not.bp_var) evaluates to $(not.bp_var).

deployment_groups:
  - group: primary
     modules:
       - id: resource1
         source: path/to/module/1
         settings:
            key1: |
              #!/bin/bash
              echo \$(cat /tmp/file1)    ## Evaluates to "echo $(cat /tmp/file1)"

Functions

Blueprint supports a number of functions that can be used within expressions to manipulate variables:

  • merge, flatten - same as Terraform's functions with the same name;
  • ghpc_stage - copy referenced file to the deployment directory;

The expressions in settings-block of Terraform modules can additionally use any functions available in Terraform.

ghpc_stage

Using local files in the blueprint can be challenging, relative paths may become invalid relatevly to deployment directory, or deployment directory can get moved to another machine.

To avoid these issues, the ghpc_stage function can be used to copy a file (or whole directory) to the deployment directory. The returned value is the path to the staged file relative to the root of deployment group directory.

  ...
  - id: script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: hi.sh
        source: $(ghpc_stage("path/relative/to/blueprint/hi.sh"))
        # or stage the whole directory
        source: $(ghpc_stage("path"))/hi.sh
        # or use it as input to another function
        content: $(file(ghpc_stage("path/hi.sh")))

The ghpc_stage function will always look first in the path specified in the blueprint. If the file is not found at this path then ghpc_stage will look for the staged file in the deployment folder, if a deployment folder exists. This means that you can redeploy a blueprint (gcluster deploy <blueprint> -w) so long as you have the deployment folder from the original deployment, even if locally referenced files are not available.

Completed Migration to Slurm-GCP v6

Slurm-GCP v5 users should read Slurm-GCP v5 EOL for information on v5 retirement and feature highlights for v6. Slurm-GCP v6 is only supported option within the Toolkit.