This blueprint uses GKE to provision a Kubernetes cluster and a G4 node pool, along with networks and service accounts. More information about G4 machines can be found here:
NOTE: The required GKE version for G4 support is >= 1.32.11-gke.1174000.
-
Install Cluster Toolkit
- Install dependencies.
- Set up Cluster Toolkit.
-
Switch to the Cluster Toolkit directory
cd cluster-toolkit -
Get the IP address for your host machine
curl ifconfig.me
-
Create a Cloud Storage bucket to store the state of the Terraform deployment:
gcloud storage buckets create gs://BUCKET_NAME \ --default-storage-class=STANDARD \ --location=COMPUTE_REGION \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following variables:
BUCKET_NAME: the name of the new Cloud Storage bucket.COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment.
-
Update the vars block of the
gke-g4-deployment.yamlfile.project_id: ID of the project where you are deploying the cluster.deployment_name: Name of the deployment.region: Compute region used for the deployment.zone: Compute zone used for the deployment.machine_type: The VM shape. See allowed values at https://cloud.google.com/compute/docs/gpus#rtx-6000-gpus.num_gpus: Number of GPUS in the VM. Can be found at https://cloud.google.com/compute/docs/gpus#rtx-6000-gpus.static_node_count: Number of nodes to create.authorized_cidr: update the IP address in<your-ip-address>/32.
-
Build the Cluster Toolkit binary
make
-
Provision the GKE cluster
./gcluster deploy -d examples/gke-g4/gke-g4-deployment.yaml examples/gke-g4/gke-g4.yaml
These four options are displayed:
(D)isplay full proposed changes, (A)pply proposed changes, (S)top and exit, (C)ontinue without applying
Type
aand hit enter to create the cluster.
This directory contains a manifest to run NVIDIA NCCL performance tests on the GKE G4 cluster.
As RDMA networking and the Google gIB plugin are not supported for G4 machines, the G4 instances use standard TCP/IP networking. The NCCL test provided here is configured to build from source. It uses the nvidia/cuda development image to clone and compile nccl-tests at runtime, ensuring the latest compatible tests are run.
-
Deploy the GKE G4 Cluster: Ensure you have deployed the cluster using the
gke-g4blueprint. -
Configure the Test Manifest: Open
nccl-test.yamland update the following fields to match your cluster configuration:cloud.google.com/gke-nodepool: Ensure this matches your deployed nodepool name (default in blueprint isg4-standard-96-g4-pool).nvidia.com/gpu(limits/requests): Set this to the number of GPUs on your node (e.g., 1, 4, 8, etc.).- Command argument
-g 2: Update the-gflag in the command to match the number of GPUs. NCCL_P2P_LEVEL: Update this to "SYS" if using 8-GPU g4-standard-384 machines. Else should remain as "PHB".
-
Apply the Job:
kubectl apply -f examples/gke-g4/nccl-test.yaml
-
View Results: Wait for the job to complete, then check the logs:
# Find the pod name kubectl get pods # View logs kubectl logs <POD_NAME>
You should see output indicating the bus bandwidth achieved during the
all_reduce_perftest.
To destroy all resources associated with creating the GKE cluster, run the following command:
./gcluster destroy CLUSTER-NAMEReplace CLUSTER-NAME with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster name is based on the deployment_name used in vars in deployment file.