Skip to content

Commit d9e8da3

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- df21bfa2ad601ffa7fd05920fb471d47f388a459 Add initial helm chart for nccl tests - f25b9c7ae253d20ee539828232d6938e4aed9897 vLLM A3Ultra single node serving of DeepSeek R1 - bdb5462570d2e7dfb7ef0789a867ba0e7fe586d9 Multi-host inference recipe for DeepSeek R1 671B with vLL... GitOrigin-RevId: bdb5462570d2e7dfb7ef0789a867ba0e7fe586d9
1 parent ffebc26 commit d9e8da3

File tree

33 files changed

+4338
-4
lines changed

33 files changed

+4338
-4
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,15 @@ Welcome to the reproducible benchmark recipes repository for GPUs! This reposito
4141
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
4242
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
4343
| **DeepSeek R1 671B** | [A3 Mega (NVIDIA H100)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) | SGLang | Inference | GKE | [Link](./inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md)
44+
| **DeepSeek R1 671B** | [A3 Mega (NVIDIA H100)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) | vLLM | Inference | GKE | [Link](./inference/a3mega/deepseek-r1-671b/vllm-serving-gke/README.md)
4445

4546
### Inference benchmarks A3 Ultra
4647

4748
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
4849
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
4950
| **Llama-3.1-405B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node/README.md)
5051
| **DeepSeek R1 671B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | SGLang | Inference | GKE | [Link](./inference/a3ultra/deepseek-r1-671b/sglang-serving-gke/README.md)
52+
| **DeepSeek R1 671B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | vLLM | Inference | GKE | [Link](./inference/a3ultra/deepseek-r1-671b/vllm-serving-gke/README.md)
5153

5254

5355
## Repository structure

inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -187,12 +187,12 @@ The recipe uses the helm chart to run the above steps.
187187

188188
4. To view the logs for the deployment, you can run
189189
```bash
190-
kubectl logs -f job/$USER-serving-deepseek-r1-model
190+
kubectl logs -f service/$USER-serving-deepseek-r1-model-svc
191191
```
192192

193193
5. Verify if the deployment has started by running
194194
```bash
195-
kubectl get deployment/$USER-serving-deepseek-r1-model
195+
kubectl get service/$USER-serving-deepseek-r1-model-svc
196196
```
197197

198198
6. Once the deployment has started, you will see logs similar to this:
@@ -275,9 +275,9 @@ The recipe uses the helm chart to run the above steps.
275275
./stream_chat.sh "Which is bigger 9.9 or 9.11 ?"
276276
```
277277

278-
10. To run benchmarks for inference, you can use the default benchamrking tool from SGLang like this
278+
10. To run benchmarks for inference, you can use the default benchmarking tool from SGLang like this
279279
```bash
280-
kubectl exec -it $USER-serving-deepseek-r1-model-0 -- /bin/bash -c "python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1100 --random-input 1000 --random-output 1000 --host 0.0.0.0 --port 30000 --output-file /gcs/benchmark_logs/sglang/ds_1000_1000_1100_output.jsonl"
280+
kubectl exec -it service/$USER-serving-deepseek-r1-model-svc -- /bin/bash -c "python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1100 --random-input 1000 --random-output 1000 --host 0.0.0.0 --port 30000 --output-file /gcs/benchmark_logs/sglang/ds_1000_1000_1100_output.jsonl"
281281
```
282282

283283
Once the benchmark is done, you can find the results in the GCS Bucket. You should see logs similar to this:

inference/a3mega/deepseek-r1-671b/vllm-serving-gke/README.md

Lines changed: 370 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
#!/bin/bash
2+
3+
# Copyright 2025 Google LLC
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
18+
[ $# -eq 0 ] && {
19+
echo "Error: No prompt provided."
20+
echo "Usage: $0 \"Your prompt here\""
21+
exit 1
22+
}
23+
24+
start_time=$(date +%s.%N)
25+
temp_file="/tmp/temp_response.txt"
26+
27+
# format JSON payload to send to the model with streaming enabled
28+
json_payload=$(jq -n \
29+
--arg prompt "$1" \
30+
'{
31+
model: "deepseek-ai/DeepSeek-R1",
32+
messages: [
33+
{role: "system", content: "You are a helpful AI assistant"},
34+
{role: "user", content: $prompt}
35+
],
36+
temperature: 0.6,
37+
top_p: 0.95,
38+
max_tokens: 2048,
39+
stream: true
40+
}')
41+
42+
echo "Streaming response:"
43+
echo "----------------"
44+
45+
# Send the request to the model and stream the response
46+
curl -sN "http://localhost:8000/v1/chat/completions" \
47+
-H "Content-Type: application/json" \
48+
-d "$json_payload" | while IFS= read -r line; do
49+
[[ -z $line ]] && continue
50+
51+
line=${line#data: }
52+
[[ $line == "[DONE]" ]] && continue
53+
54+
content=$(jq -r '.choices[0].delta.content // empty' <<< "$line")
55+
[[ -n $content ]] && {
56+
echo -n "$content"
57+
echo -n "$content" >> "$temp_file"
58+
}
59+
done
60+
61+
echo -e "\n\n----------------"
62+
63+
[[ ! -s $temp_file ]] && {
64+
echo "Error: No response received from the API or an error occurred during streaming." >&2
65+
rm -f "$temp_file"
66+
exit 1
67+
}
68+
69+
# Parse the response and extract the reasoning and final answer
70+
full_content=$(<"$temp_file")
71+
72+
[[ $full_content =~ \<think\>([[:print:][:space:]]*)\</think\> ]] && \
73+
reasoning="${BASH_REMATCH[1]}" || reasoning=""
74+
75+
final_answer=$(sed 's/.*<\/think>//; s/^[[:space:]]*//; s/[[:space:]]*$//' <<< "$full_content")
76+
77+
execution_time=$(bc <<< "$(date +%s.%N) - $start_time")
78+
79+
echo -e "\nParsed Results:"
80+
echo "----------------"
81+
echo -e "Reasoning:\n$reasoning"
82+
echo -e "\nFinal Answer:\n$final_answer"
83+
echo -e "\nExecution time: $execution_time seconds"
84+
85+
rm "$temp_file"
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
targetPlatform: "gke"
16+
17+
clusterName:
18+
queue:
19+
20+
huggingface:
21+
secretName: hf-secret
22+
secretData:
23+
token: "hf_api_token"
24+
25+
model:
26+
name: deepseek-ai/DeepSeek-R1
27+
tp_size: 8
28+
pp_size: 2
29+
30+
job:
31+
image:
32+
repository:
33+
tag:
34+
gpus: 16
35+
36+
volumes:
37+
ssdMountPath: "/ssd"
38+
gcsMounts:
39+
- bucketName:
40+
mountPath: "/gcs"
41+
42+
gpuPlatformSettings:
43+
useHostPlugin: false
44+
ncclPluginImage: "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1"
45+
rxdmImage: "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14"
46+
ncclBuildType: 223
47+
48+
network:
49+
ncclSettings:
50+
- name: NCCL_DEBUG
51+
value: "VERSION"
52+
subnetworks[]:
53+
54+
vllm:
55+
replicaCount: 1
56+
57+
service:
58+
type: ClusterIP
59+
ports:
60+
http: 8000

0 commit comments

Comments
 (0)