GKE G4 Blueprint

This blueprint uses GKE to provision a Kubernetes cluster and a G4 node pool, along with networks and service accounts. More information about G4 machines can be found here:

NOTE: The required GKE version for G4 support is >= 1.32.11-gke.1174000.

Steps to deploy the G4 blueprint

Install Cluster Toolkit
1. Install dependencies.
2. Set up Cluster Toolkit.
Switch to the Cluster Toolkit directory
```
cd cluster-toolkit
```
Get the IP address for your host machine
```
curl ifconfig.me
```
Create a Cloud Storage bucket to store the state of the Terraform deployment:
```
gcloud storage buckets create gs://BUCKET_NAME \
--default-storage-class=STANDARD \
--location=COMPUTE_REGION \
--uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning
```
Replace the following variables:
- BUCKET_NAME: the name of the new Cloud Storage bucket.
- COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment.
Update the vars block of the gke-g4-deployment.yaml file.
1. project_id: ID of the project where you are deploying the cluster.
2. deployment_name: Name of the deployment.
3. region: Compute region used for the deployment.
4. zone: Compute zone used for the deployment.
5. machine_type: The VM shape. See allowed values at https://cloud.google.com/compute/docs/gpus#rtx-6000-gpus.
6. num_gpus: Number of GPUS in the VM. Can be found at https://cloud.google.com/compute/docs/gpus#rtx-6000-gpus.
7. static_node_count: Number of nodes to create.
8. authorized_cidr: update the IP address in <your-ip-address>/32.
Build the Cluster Toolkit binary
```
make
```

Provision the GKE cluster

./gcluster deploy -d examples/gke-g4/gke-g4-deployment.yaml examples/gke-g4/gke-g4.yaml

These four options are displayed:

(D)isplay full proposed changes,
(A)pply proposed changes,
(S)top and exit,
(C)ontinue without applying

Type a and hit enter to create the cluster.

NCCL Tests for GKE G4

This directory contains a manifest to run NVIDIA NCCL performance tests on the GKE G4 cluster.

Overview

As RDMA networking and the Google gIB plugin are not supported for G4 machines, the G4 instances use standard TCP/IP networking. The NCCL test provided here is configured to build from source. It uses the nvidia/cuda development image to clone and compile nccl-tests at runtime, ensuring the latest compatible tests are run.

Running the Test

Deploy the GKE G4 Cluster: Ensure you have deployed the cluster using the gke-g4 blueprint.
Configure the Test Manifest: Open nccl-test.yaml and update the following fields to match your cluster configuration:
- cloud.google.com/gke-nodepool: Ensure this matches your deployed nodepool name (default in blueprint is g4-standard-96-g4-pool).
- nvidia.com/gpu (limits/requests): Set this to the number of GPUs on your node (e.g., 1, 4, 8, etc.).
- Command argument -g 2: Update the -g flag in the command to match the number of GPUs.
- NCCL_P2P_LEVEL: Update this to "SYS" if using 8-GPU g4-standard-384 machines. Else should remain as "PHB".

Apply the Job:

kubectl apply -f examples/gke-g4/nccl-test.yaml

View Results: Wait for the job to complete, then check the logs:
```
# Find the pod name
kubectl get pods

# View logs
kubectl logs <POD_NAME>
```
You should see output indicating the bus bandwidth achieved during the all_reduce_perf test.

Clean Up

To destroy all resources associated with creating the GKE cluster, run the following command:

./gcluster destroy CLUSTER-NAME

Replace CLUSTER-NAME with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster name is based on the deployment_name used in vars in deployment file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE G4 Blueprint

Steps to deploy the G4 blueprint

NCCL Tests for GKE G4

Overview

Running the Test

Clean Up

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GKE G4 Blueprint

Steps to deploy the G4 blueprint

NCCL Tests for GKE G4

Overview

Running the Test

Clean Up