|
| 1 | +# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 High Node Pools |
| 2 | + |
| 3 | +This [guide](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute) outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4 High](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) node pools. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +Before you begin, ensure you have completed the following: |
| 8 | + |
| 9 | +1. Create a Google Cloud project with billing enabled. |
| 10 | + |
| 11 | + a. To create a project, see [Creating and managing projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects). |
| 12 | + b. To enable billing, see [Verify the billing status of your projects](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled). |
| 13 | + |
| 14 | +2. Enabled the following APIs: |
| 15 | + |
| 16 | + - [Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com). |
| 17 | + - [Google Kubernetes Engine API](https://console.cloud.google.com/flows/enableapi?apiid=container.googleapis.com). |
| 18 | + - [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com). |
| 19 | + - [Artifact Registry API](https://console.cloud.google.com/flows/enableapi?apiid=artifactregistry.googleapis.com). |
| 20 | + |
| 21 | +3. Requested enough GPU quotas. Each `a4-highgpu-8g` machine has 8 B200 GPUs attached. |
| 22 | + 1. To view quotas, see [View the quotas for your project](/docs/quotas/view-manage). |
| 23 | + In the Filter field, select **Dimensions(e.g location)** and |
| 24 | + specify [`gpu_family:NVIDIA_B200`](https://cloud.google.com/compute/resource-usage#gpu_quota). |
| 25 | + 1. If you don't have enough quota, [request a higher quota](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota). |
| 26 | + |
| 27 | +## Reserve capacity |
| 28 | + |
| 29 | +To ensure that your workloads have the A4 High GPU resources required for these |
| 30 | +instructions, you can create a [future reservation request](https://cloud.google.com/compute/docs/instances/future-reservations-overview). |
| 31 | +With this request, you can reserve blocks of capacity for a defined duration in the |
| 32 | +future. At that date and time in the future, Compute Engine automatically |
| 33 | +provisions the blocks of capacity by creating on-demand reservations that you |
| 34 | +can immediately consume by provisioning node pools for this cluster. |
| 35 | + |
| 36 | +Additionally, as your reserved capacity might span multiple |
| 37 | +[blocks](https://cloud.google.com/ai-hypercomputer/docs/terminology#block), we recommend that you create |
| 38 | +GKE nodes on a specific block within your reservation. |
| 39 | + |
| 40 | +Do the following steps to request capacity and gather the required information |
| 41 | +to create nodes on a specific block within your reservation: |
| 42 | + |
| 43 | +1. [Request capacity](https://cloud.google.com/ai-hypercomputer/docs/request-capacity). |
| 44 | + |
| 45 | +1. To get the name of the blocks that are available for your reservation, |
| 46 | + run the following command: |
| 47 | + |
| 48 | + ```sh |
| 49 | + gcloud beta compute reservations blocks list RESERVATION_NAME \ |
| 50 | + --zone=ZONE --format "value(name)" |
| 51 | + ``` |
| 52 | + Replace the following: |
| 53 | + |
| 54 | + * `RESERVATION_NAME`: the name of your reservation. |
| 55 | + * `ZONE`: the compute zone of your reservation. |
| 56 | + |
| 57 | + The output has the following format: `BLOCK_NAME`. |
| 58 | + For example the output might be similar to the following: `example-res1-block-0001`. |
| 59 | + |
| 60 | +1. If you want to target specific blocks within a reservation when |
| 61 | + provisioning GKE node pools, you must specify the full reference |
| 62 | + to your block as follows: |
| 63 | + |
| 64 | + ```none |
| 65 | + RESERVATION_NAME/reservationBlocks/BLOCK_NAME |
| 66 | + ``` |
| 67 | + |
| 68 | + For example, using the example output in the preceding step, the full path is as follows: `example-res1/reservationBlocks/example-res1-block-0001` |
| 69 | + |
| 70 | +## The environment |
| 71 | + |
| 72 | +The environment comprises of the following components: |
| 73 | + |
| 74 | +- Client workstation: this is used to prepare, submit, and monitor ML workloads. |
| 75 | +- [Google Cloud Storage (GCS) Bucket](https://cloud.google.com/storage/docs): used for storing |
| 76 | + datasets and logs. |
| 77 | +- [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview): serves as a |
| 78 | + private container registry for storing and managing Docker images used in the deployment. |
| 79 | +- [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) |
| 80 | + Cluster with A4 High Node Pools: provides a managed Kubernetes environment to run benchmark |
| 81 | + recipes. |
| 82 | + |
| 83 | +## Set up the client workstation |
| 84 | + |
| 85 | +You have two options, you can use either a local machine or Google Cloud Shell. |
| 86 | + |
| 87 | +### Google Cloud Shell |
| 88 | + |
| 89 | +We recommend using [Google Cloud Shell](https://cloud.google.com/shell/docs) as it |
| 90 | +comes with all necessary components pre-installed. |
| 91 | + |
| 92 | +### Local client |
| 93 | +If you prefer to use your local machine, ensure your local machine has the following |
| 94 | +components installed. |
| 95 | + |
| 96 | +1. Google Cloud SDK. To install, see |
| 97 | + [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install). |
| 98 | +2. kubectl. To install, see the |
| 99 | + [kuberenetes documentation](https://kubernetes.io/docs/tasks/tools/#kubectl). |
| 100 | +3. Helm. To install, see the [Helm documentation](https://helm.sh/docs/intro/quickstart/). |
| 101 | +4. Docker. To install, see the [Docker documentation](https://docs.docker.com/engine/install/). |
| 102 | + |
| 103 | + |
| 104 | +## Set up a Google Cloud Storage bucket |
| 105 | + |
| 106 | +```bash |
| 107 | +gcloud storage buckets create gs://<BUCKET_NAME> --location=<BUCKET_LOCATION> --no-public-access-prevention |
| 108 | +``` |
| 109 | + |
| 110 | +Replace the following: |
| 111 | + |
| 112 | +- `BUCKET_NAME`: the name of your bucket. The name must comply with the |
| 113 | + [Cloud Storage bucket naming conventions](https://cloud.google.com/storage/docs/buckets#naming). |
| 114 | +- `BUCKET_LOCATION`: the location of your bucket. The bucket must be located in |
| 115 | + the same region as the GKE cluster. |
| 116 | + |
| 117 | +Add IAM binding to allow workloads authenticated via a workload identity (with the default service account) to access Cloud Storage objects. |
| 118 | + |
| 119 | + ```bash |
| 120 | + PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") |
| 121 | + gcloud storage buckets add-iam-policy-binding gs://<BUCKET_NAME> \ |
| 122 | + --role=roles/storage.objectUser \ |
| 123 | + --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \ |
| 124 | + --condition=None |
| 125 | + ``` |
| 126 | + |
| 127 | +Replace the following: |
| 128 | + |
| 129 | +- `BUCKET_NAME`: the name of your bucket created in the previous step |
| 130 | + |
| 131 | +## Set up an Artifact Registry |
| 132 | + |
| 133 | +- If you use Cloud KMS for repository encryption, create your artifact registry by using the |
| 134 | + [instructions here](https://cloud.google.com/artifact-registry/docs/repositories/create-repos#create-repo-gcloud-docker). |
| 135 | +- If you don't use Cloud KMS, you can create your repository by using the following command: |
| 136 | + |
| 137 | + ```bash |
| 138 | + gcloud artifacts repositories create <REPOSITORY> \ |
| 139 | + --repository-format=docker \ |
| 140 | + --location=<LOCATION> \ |
| 141 | + --description="<DESCRIPTION>" \ |
| 142 | + ``` |
| 143 | + Replace the following: |
| 144 | + |
| 145 | + - `REPOSITORY`: the name of the repository. For each repository location in a project, |
| 146 | + repository names must be unique. |
| 147 | + - `LOCATION`: the regional or multi-regional location for the repository. You can omit this |
| 148 | + flag if you set a default region. |
| 149 | + - `DESCRIPTION`: a description of the repository. Don't include sensitive data because |
| 150 | + repository descriptions are not encrypted. |
| 151 | + |
| 152 | + |
| 153 | +## Create a GKE Cluster with A4 High Node Pools |
| 154 | + |
| 155 | +Follow [this guide]() for |
| 156 | +detailed instructions to create a GKE cluster with A4 High node pools and required GPU driver versions. |
| 157 | + |
| 158 | +The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) to create your GKE cluster quickly while incorporating best practices: |
| 159 | + |
| 160 | +- Creation of the necessary VPC networks and subnets. |
| 161 | +- Creation of a GKE cluster with multi-networking enabled. |
| 162 | +- Creation of an A4 High node pool with NVIDIA B200 GPUs. |
| 163 | +- Installation of the required components for GPUDirect-RDMA and NCCL plugin. |
| 164 | + |
| 165 | +1. [Launch Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a |
| 166 | + different environment; however, we recommend Cloud Shell because the |
| 167 | + dependencies are already pre-installed for Cluster Toolkit. If you |
| 168 | + don't want to use Cloud Shell, follow the instructions to [install |
| 169 | + dependencies](/cluster-toolkit/docs/setup/install-dependencies) to prepare a |
| 170 | + different environment. |
| 171 | + |
| 172 | +1. Clone the Cluster Toolkit from the git repository: |
| 173 | + |
| 174 | + ```sh |
| 175 | + cd ~ |
| 176 | + git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git |
| 177 | + ``` |
| 178 | +1. Install the Cluster Toolkit: |
| 179 | + |
| 180 | + ```sh |
| 181 | + cd cluster-toolkit && git checkout main && make |
| 182 | + ``` |
| 183 | + |
| 184 | +1. Create a Cloud Storage bucket to store the state of the Terraform |
| 185 | + deployment: |
| 186 | + |
| 187 | + ```sh |
| 188 | + gcloud storage buckets create gs://TF_STATE_BUCKET_NAME \ |
| 189 | + --default-storage-class=STANDARD \ |
| 190 | + --location=COMPUTE_REGION \ |
| 191 | + --uniform-bucket-level-access |
| 192 | + gcloud storage buckets update gs://TF_STATE_BUCKET_NAME --versioning |
| 193 | + ``` |
| 194 | + |
| 195 | + Replace the following variables: |
| 196 | + |
| 197 | + * `TF_STATE_BUCKET_NAME`: the name of the new Cloud Storage bucket. |
| 198 | + * `COMPUTE_REGION`: the compute region where you want to store the state of the Terraform deployment. |
| 199 | + |
| 200 | +1. In the [`examples/gke-a4-highgpu/gke-a4-highgpu-deployment.yaml`](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4-highgpu/gke-a4-highgpu-deployment.yaml) |
| 201 | + file, replace the following variables in the `terraform_backend_defaults` and |
| 202 | + `vars` sections to match the specific values for your deployment: |
| 203 | + |
| 204 | + * `BUCKET_NAME`: the name of the Cloud Storage bucket you created in the |
| 205 | + previous step to store the state of Terraform deployment. |
| 206 | + * `PROJECT_ID`: your Google Cloud project ID. |
| 207 | + * `COMPUTE_REGION`: the compute region for the cluster. |
| 208 | + * `COMPUTE_ZONE`: the compute zone for the node pool of A4 High machines. |
| 209 | + * `IP_ADDRESS/SUFFIX`: The IP address range that you want to allow to |
| 210 | + connect with the cluster. This CIDR block must include the IP address of |
| 211 | + the machine to call Terraform. |
| 212 | + * `RESERVATION_NAME`: the name of your reservation. |
| 213 | + * `BLOCK_NAME`: the name of a specific block within the reservation. |
| 214 | + * `NODE_COUNT`: the number of A4 High nodes in your cluster. |
| 215 | + |
| 216 | + To modify advanced settings, edit |
| 217 | + `examples/gke-a4-highgpu/gke-a4-highgpu.yaml`. |
| 218 | + |
| 219 | +1. Generate [Application Default Credentials (ADC)](/docs/authentication/provide-credentials-adc#google-idp) |
| 220 | + to provide access to Terraform. |
| 221 | + |
| 222 | +1. Deploy the blueprint to provision the GKE infrastructure |
| 223 | + using A4 High machine types: |
| 224 | + |
| 225 | + ```sh |
| 226 | + cd ~/cluster-toolkit |
| 227 | + ./gcluster deploy -d \ |
| 228 | + examples/gke-a4-highgpu/gke-a4-highgpu-deployment.yaml \ |
| 229 | + examples/gke-a4-highgpu/gke-a4-highgpu.yaml |
| 230 | + ``` |
| 231 | + |
| 232 | +## Clean up {:#clean-up} |
| 233 | + |
| 234 | +To avoid recurring charges for the resources used on this page, clean up the |
| 235 | +resources provisioned by Cluster Toolkit, including the |
| 236 | +VPC networks and GKE cluster: |
| 237 | + |
| 238 | + ```sh |
| 239 | + ./gcluster destroy gke-a4-high/ |
| 240 | + ``` |
| 241 | + |
| 242 | + |
| 243 | +## What's next |
| 244 | + |
| 245 | +Once you have set up your GKE cluster with A4 High node pools, you can proceed to deploy and |
| 246 | +run your [benchmark recipes](../README.md#benchmarks-support-matrix). |
| 247 | + |
| 248 | +## Get Help |
| 249 | + |
| 250 | +If you encounter any issues or have questions about this setup, use one of the following |
| 251 | +resources: |
| 252 | + |
| 253 | +- Consult the [official GKE documentation](https://cloud.google.com/kubernetes-engine/docs). |
| 254 | +- Check the issues section of this repository for known problems and solutions. |
| 255 | +- Reach out to Google Cloud support. |
0 commit comments