Skip to content

Commit b990765

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- 99e8186454e10c4bf07dce0f91b0ab0c76ca1a72 Fix README for multinode inference for DeepSeek R1 with A... GitOrigin-RevId: 99e8186454e10c4bf07dce0f91b0ab0c76ca1a72
1 parent eeadb34 commit b990765

File tree

1 file changed

+28
-12
lines changed
  • inference/a3mega/deepseek-r1-671b/sglang-serving-gke

1 file changed

+28
-12
lines changed

inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,23 @@
22

33
This recipe outlines the steps to benchmark inference of a DeepSeek R1 671B model using [SGLang](https://github.com/sgl-project/sglang/tree/main) on an [A3 Mega GKE Node pool](https://cloud.google.com/kubernetes-engine) with multiple nodes.
44

5-
In order to run this recipe, we use the [LeaderWorkerSet API](https://github.com/kubernetes-sigs/lws) in Kubernetes to spin up multiple nodes and handle distributed inference.
5+
The recipe uses [LeaderWorkerSet API](https://github.com/kubernetes-sigs/lws) in Kubernetes to spin up multiple nodes and handle distributed inference workload. LWS enables treating multiple Pods as a group, simplifying the management of distributed model serving.
66

77
## Orchestration and deployment tools
88

99
For this recipe, the following setup is used:
1010

1111
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
12-
- Job configuration and deployment - Helm chart is used to configure and deploy the
13-
[Kubernetes Index Job](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs).
14-
This job encapsulates inference of DeepSeek R1 671B model using SGLang. The chart generates the manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
12+
- LeaderWorkerSet Deployment - Helm chart is used to configure and deploy multi-node inference
13+
using the [LeaderWorkerSet API](https://github.com/kubernetes-sigs/lws) provisioning leader
14+
and worker pods for distributed inference of the DeepSeek R1 671B model using SGLang. The chart generates the manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
1515

1616
## Prerequisites
1717

1818
Before running this recipe, ensure your environment is configured as follows:
1919

2020
- A GKE cluster with the following setup:
2121
- An A3 Mega node pool (2 nodes, 16 GPUs)
22-
- Topology-aware scheduling enabled
2322
- An Artifact Registry repository to store the Docker image.
2423
- A Google Cloud Storage (GCS) bucket to store results.
2524
*Important: This bucket must be in the same region as the GKE cluster*.
@@ -74,7 +73,7 @@ From your client, complete the following steps:
7473
- `<ARTIFACT_REGISTRY>`: the full name of your Artifact
7574
Registry in the following format: *LOCATION*-docker.pkg.dev/*PROJECT_ID*/*REPOSITORY*
7675
- `<SGLANG_IMAGE>`: the name of the SGLang image
77-
- `<SGLANG_VERSION>`: the version of the SGLang image
76+
- `<SGLANG_VERSION>`: the version of the SGLang image. We recommended running the recipe with SGLang v0.4.3.post2-cu125-srt.
7877

7978
1. Set the default project:
8079

@@ -155,13 +154,29 @@ The recipe uses the helm chart to run the above steps.
155154
--dry-run=client -o yaml | kubectl apply -f -
156155
```
157156

158-
2. Install the LeaderWorkerSet API (LWS). Please follow the instructions [here](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md#install-a-released-version) to install LWS.
157+
2. Install the LeaderWorkerSet API (LWS). Please follow the instructions [here](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md#install-a-released-version) to install a specific version of LWS API.
158+
159+
```bash
160+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
161+
```
162+
163+
Validate that the LeaderWorkerSet controller is running in the lws-system namespace, using the following command:
164+
165+
```bash
166+
kubectl get pod -n lws-system
167+
```
168+
169+
The output is similar to the following:
170+
```bash
171+
NAME READY STATUS RESTARTS AGE
172+
lws-controller-manager-56956867cb-4km9g 1/1 Running 0 24h
173+
```
159174

160175
3. Install the helm chart to prepare the model.
161176

162177
```bash
163178
cd $RECIPE_ROOT
164-
/usr/local/bin/helm/helm install -f values.yaml \
179+
helm install -f values.yaml \
165180
--set job.image.repository=${ARTIFACT_REGISTRY}/${SGLANG_IMAGE} \
166181
--set clusterName=${CLUSTER_NAME} \
167182
--set job.image.tag=${SGLANG_VERSION} \
@@ -201,6 +216,7 @@ The recipe uses the helm chart to run the above steps.
201216
```bash
202217
kubectl port-forward svc/$USER-serving-deepseek-r1-model-svc 30000:30000
203218
```
219+
204220
8. Make the API requests to the service.
205221

206222
```bash
@@ -306,7 +322,7 @@ To clean up the resources created by this recipe, complete the following steps:
306322
1. Uninstall the helm chart.
307323

308324
```bash
309-
/usr/local/bin/helm/helm uninstall $USER-serving-deepseek-r1-model
325+
helm uninstall $USER-serving-deepseek-r1-model
310326
```
311327

312328
2. Delete the Kubernetes Secret.
@@ -317,12 +333,12 @@ To clean up the resources created by this recipe, complete the following steps:
317333

318334
### Running the recipe on a cluster that does not use the default configuration.
319335

320-
If you created your cluster using the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-ultra.md), it is configured with default settings that include the names for networks and subnetworks used for communication between:
336+
If you created your cluster using the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-mega.md), it is configured with default settings that include the names for networks and subnetworks used for communication between:
321337

322-
- The host to external services.
338+
- The host to external services.
323339
- GPU-to GPU communication.
324340

325-
For clusters with this default configuration, the Helm chart can automatically generate the [required networking annotations in a Pod's metadata](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#configure-pod-manifests-rdma). Therefore, you can use the streamlined command to install the chart, as described in the the [Single A3 Ultra Node Benchmarking using FP8 Quantization](#single-a3-ultra-node-benchmarking-using-fp8-quantization) section.
341+
For clusters with this default configuration, the Helm chart can automatically generate the [required networking annotations in a Pod's metadata](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#configure-pod-manifests-rdma). Therefore, you can use the streamlined command to install the chart, as described in the the [Multi node inference benchmark of DeepSeek R1 671B with SGLang on A3 Mega GKE Node Pool](#multi-node-inference-benchmark-of-deepseek-r1-671b-with-sglang-on-a3-mega-gke-node-pool) section.
326342
327343
To configure the correct networking annotations for a cluster that uses non-default names for GKE Network resources, you must provide the names of the GKE Network resources in you cluster when installing the chart. Use the following example command, remembering to replace the example values with the actual names of your cluster's GKE Network resources:
328344

0 commit comments

Comments
 (0)