Skip to content

Commit d057509

Browse files
authored
Add DeepSeek-R1 deployment with AIBrix blog post (#13)
* Add DeepSeek-R1 deployment with AIBrix blog post * Address review feedback Signed-off-by: Jiaxin Shan <[email protected]>
1 parent f5de79a commit d057509

File tree

4 files changed

+216
-0
lines changed

4 files changed

+216
-0
lines changed
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
---
2+
date: '2025-03-09T09:30:00-08:00'
3+
draft: false
4+
title: 'DeepSeek-R1 671B multi-host Deployment in AIBrix'
5+
disableShare: true
6+
hideSummary: true
7+
searchHidden: false
8+
ShowReadingTime: false
9+
ShowWordCount: false
10+
ShowBreadCrumbs: true
11+
ShowPostNavLinks: true
12+
ShowRssButtonInSectionTermList: false
13+
UseHugoToc: true
14+
ShowToc: true
15+
tocopen: true
16+
---
17+
18+
This blog post introduces deploying DeepSeek R1 using AIBrix. DeepSeek-R1 demonstrates remarkable proficiency in reasoning tasks through step-by-step training process. It features 671B total parameters with 37B active parameters, and 128k context length. However, due to its large size, the deployment process is more complex. AIBrix provides enough tools that enables users to deploy and manage distributed inference services efficiently.
19+
20+
![deepseek-r1](/images/deepseek-r1/deepseek-performance.jpg)
21+
ref: https://huggingface.co/deepseek-ai/DeepSeek-R1/resolve/main/figures/benchmark.jpg
22+
23+
## Prerequisites
24+
25+
Before deploying DeepSeek-R1 in AIBrix, some preliminary tasks such as downloading model weights to object storage or a shared file system and setting up a customized container image must be completed. This blog will focus on the critical steps rather than covering all details. You can check our [code samples and tutorial](https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1) for more details.
26+
27+
### Cluster Configuration
28+
29+
DeepSeek-R1 671B requires 16 80GB GPUs. We used the following instance specifications for testing. You can use similar setups based on your environment.
30+
31+
- Cloud: Volcano Engine
32+
- Instance: ecs.ebmhpcpni3l.48xlarge * 2
33+
- CPU: 192 vCPU
34+
- Memory: 2048GiB DRAM
35+
- GPU: NVIDIA H20-SXM5-96GB*8
36+
- Network: 400 Gbps * 8 RDMA + 96 Gbps
37+
- Disk: Local NVME 3576GiB * 4
38+
39+
### vLLM Image
40+
41+
The image used for this deployment is `aibrix/vllm-openai:v0.7.3.self.post1`, which is a custom-built image by AIBrix. There are two main reasons behind it:
42+
43+
- In the upstream v0.7.3, there was an [issue](([#vllm/issues/13136](https://github.com/vllm-project/vllm/issues/13136))) related to legacy NCCL versions that caused occasional system hangs. We addressed this by upgrading `nvidia-nccl-cu12==2.25.1` to enhance the stability of the communication.
44+
- In v0.7.3, a regression problem related to Ray occurred where our previous modifications in vLLM were overwritten. To mitigate this issue, we reintroduced `ray[default,adag]` to provide better probe support for high availability and fault detection.
45+
46+
If you like to build image youself, you can use following Dockerfile.
47+
48+
```dockerfile
49+
FROM vllm/vllm-openai:v0.7.3
50+
RUN pip3 install -U ray[default,adag]==2.40.0
51+
RUN pip3 install -U nvidia-nccl-cu12
52+
ENTRYPOINT [""]
53+
```
54+
55+
> Note: For users in China, you may prefix the image name with `aibrix-container-registry-cn-beijing.cr.volces.com/` when pulling the image from our registry.
56+
> For instance, instead of just `aibrix/vllm-openai:v0.7.3.self.post1`, you should use `aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.7.3.self.post1`. The same rule applies to `aibrix/runtime:v0.2.1`.
57+
58+
### Model Weights
59+
60+
Users can select different storage options for the [model weights](https://huggingface.co/deepseek-ai/DeepSeek-R1) according to their cloud service providers. Here, we will discuss four common scenarios:
61+
62+
- **HuggingFace**: A pod can directly retrieve model weights from HuggingFace. However, it's important to note that fetching from HuggingFace is not recommended for DeepSeek R1. This is because the varying tensor sizes result in numerous random reads, which significantly reduces network and I/O efficiency.
63+
- **Persistent Volume**: Cloud providers such as AWS with Lustre or Google Cloud offer persistent disks through their [Container Storage Interface (CSI)](https://kubernetes-csi.github.io/docs/). Users can effortlessly mount a [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) to the pod, enabling seamless access to the model weights stored on these persistent disks.
64+
- **Object Storage with AI Runtime**: Users have the option to store model weights in object storage services like Amazon S3 or Google Cloud Storage (GCS). In this case, the AIbrix AI runtime will automatically download the model to the host volume. This approach offers flexibility and scalability, leveraging the advantages of object storage for storing large amounts of data.
65+
- **Local Disk**: For local disk storage, an additional process is required to download the model weights to the local disk. We assume that a local volume is available and can be successfully mounted to the pod. This option may be suitable for environments where local storage offers performance benefits or when there are specific security or latency requirements.
66+
67+
| Storage Options | Description | Sample Files |
68+
| ----------------------------------------------- |------------------------------------------|----------------------------------------|
69+
| HuggingFace | no volume needed | [Link](https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1/deepseek-r1-huggingface.yaml) |
70+
| Persistent Volume | models volume, PVC | [Link](https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1//deepseek-r1-pvc.yaml) |
71+
| Object Storage(S3 / GCS) with AIBrix AI Runtime | models volume, HostPath | [Link](https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1//deepseek-r1-ai-runtime.yaml) |
72+
| Local Disk | models volume, HostPath + InitContainer | [Link](https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1//deepseek-r1-local-nvme.yaml) |
73+
74+
### High Performance Network
75+
76+
To leverage RDMA to achieve the best performance in network communication, we need to configure the pod configuration to use RDMA.
77+
`k8s.volcengine.com/pod-networks` is configured in the annotation like below and `vke.volcengine.com/rdma: "8"` is needed at the pod resource level.
78+
This is just one example on volcano engine cloud, You need to make corresponding change based on your own cloud environment.
79+
80+
```yaml
81+
k8s.volcengine.com/pod-networks: |
82+
[
83+
{
84+
"cniConf":{
85+
"name":"rdma"
86+
}
87+
},
88+
....
89+
{
90+
"cniConf":{
91+
"name":"rdma"
92+
}
93+
}
94+
]
95+
```
96+
Besides that, we also need `IPC_LOCK` and Share Memory Support.
97+
98+
```yaml
99+
securityContext:
100+
capabilities:
101+
add:
102+
- IPC_LOCK
103+
```
104+
105+
## Installation
106+
107+
AIBrix [v0.2.1](https://github.com/vllm-project/aibrix/releases/tag/v0.2.1) is required for multi-node deployment. When deploying AIBrix, it's important to note that the AIBrix mirror is mainly hosted on Dockerhub, and deploying it in environments has Dockerhub access limitation can be challenging. To overcome this, let check our tutorial to override the control plane images with your own registry, enabling a smooth deployment of customized AIBrix.
108+
109+
It should be emphasized that there might be aspects related to the cloud environment, for example, `ReadWriteMany` volume provider, `Local Disk` etc. We take [Volcano Cloud](https://www.volcengine.com/) as a reference here, but users are advised to check their own cloud infrastructure. While we can provide some general recommendations, due to limited resources, we haven't been able to test all cloud platforms thoroughly. We encourage the community to contribute by submitting a Pull Request (PR) to help improve our support for different clouds.
110+
111+
```
112+
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-dependency-v0.2.1.yaml
113+
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-core-v0.2.1.yaml
114+
```
115+
## How AIBrix Support DeepSeek-R1
116+
117+
AIBrix plays a crucial role in supporting the Deepseek-r1 671B model deployment.
118+
It provides a comprehensive platform that enables distributed orchestration, efficient traffic routing, and intelligent scaling capabilities.These features are essential for handling the large-scale and resource-intensive nature of the Deepseek-r1 671B model.
119+
120+
We will briefly talk about `RayClusterFleet`, `Gateway Plugin` and `Autoscaler` related capabilities for this case before we jump into the deployment.
121+
122+
![deepseek-r1](/images/deepseek-r1/deepseek-deployment.png)
123+
124+
`RayClusterFleet` plays a pivotal role in managing the distributed inference orchestration. It provisions pods and constructs a Ray cluster, within which the vLLM server is launched. Each mini Ray cluster thus constitutes an inference replica.
125+
126+
In a multi-node environment, the vLLM HTTP server is initiated on the head node. The remaining GPU nodes function as workers, with no HTTP service running on them. Correspondingly, the AIBrix router routes requests **exclusively** to the head node. Similarly, the autoscaler fetches metrics **solely** from the service pod.
127+
This distributed configuration ensures that the orchestration, routing, and autoscaling mechanisms operate effectively. By managing multi-node setups in a manner analogous to single-node operations, it streamlines the overall deployment process for super large models like deepseek-r1.
128+
129+
## Model Deployment
130+
131+
Firstly, make sure you change network and object storage configuration, for example using [s3](https://aibrix.readthedocs.io/latest/features/runtime.html#download-from-s3). `DOWNLOADER_ALLOW_FILE_SUFFIX` has to be `json, safetensors, py` for deepseek-r1.
132+
133+
Then run following command to deploy the model and associated kv cache based autoscaling strategy. Note, it really depends on the network speed between compute and object storage, the deployment make takes up to 20 mins.
134+
135+
```bash
136+
kubectl apply -f deepseek-r1-ai-runtime.yaml
137+
kubectl apply -f deepseek-r1-autoscaling.yaml
138+
```
139+
140+
After a while, you should see running pods similar to the ones shown below.
141+
142+
```bash
143+
kubectl get pods
144+
NAME READY STATUS RESTARTS AGE
145+
deepseek-r1-671b-7ffb754f75-ggnzf-head-7xr6q 1/1 Running 0 25m
146+
deepseek-r1-671b-7ffb754f75-ggnzf-worker-group-worker-gj456 1/1 Running 0 25m
147+
```
148+
149+
## Send Requests
150+
151+
Expose endpoint through following command.
152+
153+
```bash
154+
# Option 1: Kubernetes cluster with LoadBalancer support
155+
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
156+
ENDPOINT="${LB_IP}:80"
157+
# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
158+
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
159+
ENDPOINT="localhost:8888"
160+
```
161+
162+
```bash
163+
curl http://${ENDPOINT}/v1/chat/completions \
164+
-H "Content-Type: application/json" -H "routing-strategy: least-request" \
165+
-d '{
166+
"model": "deepseek-r1-671b",
167+
"messages": [
168+
{"role": "system", "content": "You are a helpful assistant."},
169+
{"role": "user", "content": "Who won the world series in 2020?"}
170+
]
171+
}'
172+
```
173+
> Note: `-H "routing-strategy: least-request"` header can be removed if you like to use default kubernetes routing strategies
174+
175+
You supposed to see some response like below
176+
177+
```bash
178+
{"id":"chatcmpl-d26583d2-96e5-42c4-a322-133c7d0e505d","object":"chat.completion","created":1740967604,"model":"deepseek-r1-671b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user is asking which team won the World Series in 2020. Let me recall, the World Series is the championship series of Major League Baseball (MLB) in the United States. I remember that 2020 was a unique year because of the COVID-19 pandemic, which affected the schedule and format of the season. The season was shortened, and there were some changes to the playoff structure.\n\nI think the Los Angeles Dodgers won the World Series around that time. Let me verify. The 2020 World Series was held at a neutral site, which was Globe Life Field in Arlington, Texas, to minimize travel and reduce the risk of COVID-19 spread. The Dodgers faced the Tampa Bay Rays. The Dodgers were led by players like Mookie Betts, Corey Seager, and Clayton Kershaw. They won the series in six games. The clinching game was Game 6, where the Dodgers beat the Rays 3-1. That victory gave the Dodgers their first title since 1988, ending a long drought.\n\nWait, let me make sure I got the opponent right. Was it the Rays or another team? Yes, I'm pretty confident it was the Rays because earlier in the playoffs, teams like the Braves and Dodgers were in the National League, while the Rays were the American League champions. The Rays had a strong team with players like Randy Arozarena, who had a standout postseason. But the Dodgers ultimately triumphed. So the answer should be the Los Angeles Dodgers. Let me double-check a reliable source if I'm unsure. Confirming now... yes, the Dodgers won the 2020 World Series against the Tampa Bay Rays in six games. So the user needs to know both the winner and maybe a bit of context, like it being in a neutral location. Okay, ready to provide a concise answer with those details.\n</think>\n\nThe Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games. This championship marked the Dodgers' first title since 1988. Notably, the 2020 series was held at Globe Life Field in Arlington, Texas—a neutral site—due to COVID-19 health and safety protocols.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":19,"total_tokens":472,"completion_tokens":453,"prompt_tokens_details":null},"prompt_logprobs":null}%
179+
```
180+
181+
## Monitoring
182+
183+
We assume you have [Prometheus](https://prometheus.io/) setup in the cluster, then you can deploy `ServiceMonitor` to allow it to fetch metrics from deepseek deployment.
184+
185+
```yaml
186+
apiVersion: monitoring.coreos.com/v1
187+
kind: ServiceMonitor
188+
metadata:
189+
name: deepseek-r1-svc-discover
190+
namespace: default
191+
labels:
192+
volcengine.vmp: "true"
193+
spec:
194+
endpoints:
195+
- port: service
196+
namespaceSelector:
197+
matchNames:
198+
- default
199+
selector:
200+
matchLabels:
201+
ray.io/node-type: head
202+
```
203+
204+
you can use our own built [dashboard](https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1/static/AIBrix%20Engine%20Dashboard%20(vLLM)-1741078999667.json) to visit your model performance.
205+
206+
![dashboard](/images/deepseek-r1/deepseek-dashboard.png)
207+
208+
> Note: After the dashboard is imported in Grafana, you may need minor changes like `labels` in PromQL due to different prometheus setup.
209+
210+
## Questions?
211+
212+
If you have any questions, please feel free to reach out to us in Slack channel [#AIBrix](https://vllm-dev.slack.com/archives/C08EQ883CSV). We'd like to support your case!
213+
214+
- Github Code: [https://github.com/vllm-project/aibrix](https://github.com/vllm-project/aibrix)
215+
- Github Issues Page: [Issues](https://github.com/vllm-project/aibrix/issues)
216+
- Slack Channel [#AIBrix](https://vllm-dev.slack.com/archives/C08EQ883CSV)
250 KB
Loading
45.1 KB
Loading
759 KB
Loading

0 commit comments

Comments
 (0)