You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- 99e8186454e10c4bf07dce0f91b0ab0c76ca1a72 Fix README for multinode inference for DeepSeek R1 with A...
GitOrigin-RevId: 99e8186454e10c4bf07dce0f91b0ab0c76ca1a72
Copy file name to clipboardExpand all lines: inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md
+28-12Lines changed: 28 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,24 +2,23 @@
2
2
3
3
This recipe outlines the steps to benchmark inference of a DeepSeek R1 671B model using [SGLang](https://github.com/sgl-project/sglang/tree/main) on an [A3 Mega GKE Node pool](https://cloud.google.com/kubernetes-engine) with multiple nodes.
4
4
5
-
In order to run this recipe, we use the [LeaderWorkerSet API](https://github.com/kubernetes-sigs/lws) in Kubernetes to spin up multiple nodes and handle distributed inference.
5
+
The recipe uses [LeaderWorkerSet API](https://github.com/kubernetes-sigs/lws) in Kubernetes to spin up multiple nodes and handle distributed inference workload. LWS enables treating multiple Pods as a group, simplifying the management of distributed model serving.
-Job configuration and deployment - Helm chart is used to configure and deploy the
13
-
[Kubernetes Index Job](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs).
14
-
This job encapsulates inference of DeepSeek R1 671B model using SGLang. The chart generates the manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
12
+
-LeaderWorkerSet Deployment - Helm chart is used to configure and deploy multi-node inference
13
+
using the [LeaderWorkerSet API](https://github.com/kubernetes-sigs/lws) provisioning leader
14
+
and worker pods for distributed inference of the DeepSeek R1 671B model using SGLang. The chart generates the manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
15
15
16
16
## Prerequisites
17
17
18
18
Before running this recipe, ensure your environment is configured as follows:
19
19
20
20
- A GKE cluster with the following setup:
21
21
- An A3 Mega node pool (2 nodes, 16 GPUs)
22
-
- Topology-aware scheduling enabled
23
22
- An Artifact Registry repository to store the Docker image.
24
23
- A Google Cloud Storage (GCS) bucket to store results.
25
24
*Important: This bucket must be in the same region as the GKE cluster*.
@@ -74,7 +73,7 @@ From your client, complete the following steps:
74
73
-`<ARTIFACT_REGISTRY>`: the full name of your Artifact
75
74
Registry in the following format: *LOCATION*-docker.pkg.dev/*PROJECT_ID*/*REPOSITORY*
76
75
-`<SGLANG_IMAGE>`: the name of the SGLang image
77
-
-`<SGLANG_VERSION>`: the version of the SGLang image
76
+
-`<SGLANG_VERSION>`: the version of the SGLang image. We recommended running the recipe with SGLang v0.4.3.post2-cu125-srt.
78
77
79
78
1. Set the default project:
80
79
@@ -155,13 +154,29 @@ The recipe uses the helm chart to run the above steps.
155
154
--dry-run=client -o yaml | kubectl apply -f -
156
155
```
157
156
158
-
2. Install the LeaderWorkerSet API (LWS). Please follow the instructions [here](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md#install-a-released-version) to install LWS.
157
+
2. Install the LeaderWorkerSet API (LWS). Please follow the instructions [here](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md#install-a-released-version) to install a specific version of LWS API.
@@ -317,12 +333,12 @@ To clean up the resources created by this recipe, complete the following steps:
317
333
318
334
### Running the recipe on a cluster that does not use the default configuration.
319
335
320
-
If you created your cluster using the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-ultra.md), it is configured with default settings that include the names for networks and subnetworks used for communication between:
336
+
If you created your cluster using the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-mega.md), it is configured with default settings that include the names for networks and subnetworks used for communication between:
321
337
322
-
- The host to external services.
338
+
- The host to external services.
323
339
- GPU-to GPU communication.
324
340
325
-
For clusters with this default configuration, the Helm chart can automatically generate the [required networking annotations in a Pod's metadata](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#configure-pod-manifests-rdma). Therefore, you can use the streamlined command to install the chart, as described in the the [Single A3 Ultra Node Benchmarking using FP8 Quantization](#single-a3-ultra-node-benchmarking-using-fp8-quantization) section.
341
+
For clusters with this default configuration, the Helm chart can automatically generate the [required networking annotations in a Pod's metadata](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#configure-pod-manifests-rdma). Therefore, you can use the streamlined command to install the chart, as described in the the [Multi node inference benchmark of DeepSeek R1 671B with SGLang on A3 Mega GKE Node Pool](#multi-node-inference-benchmark-of-deepseek-r1-671b-with-sglang-on-a3-mega-gke-node-pool) section.
326
342
327
343
To configure the correct networking annotations for a cluster that uses non-default names for GKE Network resources, you must provide the names of the GKE Network resources in you cluster when installing the chart. Use the following example command, remembering to replace the example values with the actual names of your cluster's GKE Network resources:
0 commit comments