Skip to content

Commit d7724fb

Browse files
dtrawinspgladkows
andauthored
multi GPU scalability (openvinotoolkit#3155) (openvinotoolkit#3230)
* multi GPU scalability (openvinotoolkit#3155) * Update demos/continuous_batching/scaling/README.md Co-authored-by: Patrycja Gładkowska <[email protected]>
1 parent 795fb09 commit d7724fb

File tree

1 file changed

+160
-15
lines changed

1 file changed

+160
-15
lines changed

demos/continuous_batching/scaling/README.md

Lines changed: 160 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
# Scaling on a dual CPU socket server {#ovms_demos_continuous_batching_scaling}
1+
# Scaling on a dual CPU socket server and multi-GPU hosts {#ovms_demos_continuous_batching_scaling}
2+
3+
## Scaling on dual CPU sockets
24

35
> **Note**: This demo uses Docker and has been tested only on Linux hosts
46
@@ -10,7 +12,7 @@ It deploys 6 instances of the model server allocated to different NUMA nodes on
1012

1113
![drawing](./loadbalancing.png)
1214

13-
## Start the Model Server instances
15+
### Start the Model Server instances
1416

1517
Let's assume we have two CPU sockets server with two NUMA nodes.
1618
```bash
@@ -23,21 +25,31 @@ NUMA node3 CPU(s): 96-127,288-319
2325
NUMA node4 CPU(s): 128-159,320-351
2426
NUMA node5 CPU(s): 160-191,352-383
2527
```
26-
Following the prework from [demo](../README.md) start the instances like below:
28+
29+
Export the model:
30+
```bash
31+
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/export_model.py -o export_model.py
32+
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/requirements.txt
33+
mkdir models
34+
python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --model_name Meta-Llama-3-8B-Instruct_FP16 --weight-format fp16 --model_repository_path models
35+
```
36+
2737
```bash
28-
docker run --cpuset-cpus $(lscpu | grep node0 | cut -d: -f2) -d --rm -p 8003:8003 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8003 --config_path /workspace/config.json
29-
docker run --cpuset-cpus $(lscpu | grep node1 | cut -d: -f2) -d --rm -p 8004:8004 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8004 --config_path /workspace/config.json
30-
docker run --cpuset-cpus $(lscpu | grep node2 | cut -d: -f2) -d --rm -p 8005:8005 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8005 --config_path /workspace/config.json
31-
docker run --cpuset-cpus $(lscpu | grep node3 | cut -d: -f2) -d --rm -p 8006:8006 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8006 --config_path /workspace/config.json
32-
docker run --cpuset-cpus $(lscpu | grep node4 | cut -d: -f2) -d --rm -p 8007:8007 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8007 --config_path /workspace/config.json
33-
docker run --cpuset-cpus $(lscpu | grep node5 | cut -d: -f2) -d --rm -p 8008:8008 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8008 --config_path /workspace/config.json
38+
docker run --cpuset-cpus $(lscpu | grep node0 | cut -d: -f2) -d --rm -p 8003:8003 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_FP16:/model:ro openvino/model_server:latest --rest_port 8003 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
39+
docker run --cpuset-cpus $(lscpu | grep node1 | cut -d: -f2) -d --rm -p 8004:8004 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_FP16:/model:ro openvino/model_server:latest --rest_port 8004 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
40+
docker run --cpuset-cpus $(lscpu | grep node2 | cut -d: -f2) -d --rm -p 8005:8005 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_FP16:/model:ro openvino/model_server:latest --rest_port 8005 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
41+
docker run --cpuset-cpus $(lscpu | grep node3 | cut -d: -f2) -d --rm -p 8006:8006 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_FP16:/model:ro openvino/model_server:latest --rest_port 8006 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
42+
docker run --cpuset-cpus $(lscpu | grep node4 | cut -d: -f2) -d --rm -p 8007:8007 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_FP16:/model:ro openvino/model_server:latest --rest_port 8007 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
43+
docker run --cpuset-cpus $(lscpu | grep node5 | cut -d: -f2) -d --rm -p 8008:8008 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_FP16:/model:ro openvino/model_server:latest --rest_port 8008 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
3444
```
3545
Confirm in logs if the containers loaded the models successfully.
3646

37-
## Start Nginx load balancer
47+
### Start Nginx load balancer
3848

39-
The configuration below is a basic example distributing the clients between two started instances.
49+
The configuration below is a basic example distributing the clients between six started instances.
4050
```
51+
worker_processes 16;
52+
worker_rlimit_nofile 40000;
4153
events {
4254
worker_connections 10000;
4355
}
@@ -63,10 +75,15 @@ Start the Nginx container with:
6375
docker run -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro -d --net=host -p 80:80 nginx
6476
```
6577

66-
## Testing the scalability
78+
### Testing the scalability
6779

68-
Start benchmarking script like in [demo](../README.md), pointing to the load balancer port and host.
80+
Let's use the benchmark_serving script from vllm repository:
6981
```bash
82+
git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
83+
cd vllm
84+
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
85+
cd benchmarks
86+
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json
7087
python benchmark_serving.py --host localhost --port 80 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 6000 --request-rate 20
7188
Initial test run completed. Starting main benchmark run...
7289
Traffic request rate: 20
@@ -89,6 +106,134 @@ Median TPOT (ms): 238.49
89106
P99 TPOT (ms): 261.74
90107
```
91108

92-
# Scaling in Kubernetes
109+
## Scaling horizontally on a multi GPU host
110+
111+
Throughput scalability on multi GPU systems can be achieved by starting multiple instances assigned to each card. The commands below were executed on a host with 4 Battlemage B580 GPU cards.
112+
113+
### Start the Model Server instances
114+
115+
```bash
116+
ls -1 /dev/dri/
117+
by-path
118+
card0
119+
card1
120+
card2
121+
card3
122+
renderD128
123+
renderD129
124+
renderD130
125+
renderD131
126+
```
127+
128+
Export the model:
129+
```bash
130+
python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --model_name Meta-Llama-3-8B-Instruct_INT4 --weight-format int4 --model_repository_path models --target_device GPU --cache 4
131+
```
132+
133+
```bash
134+
docker run --device /dev/dri/renderD128 -d --rm -p 8003:8003 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8003 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
135+
docker run --device /dev/dri/renderD129 -d --rm -p 8004:8004 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8004 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
136+
docker run --device /dev/dri/renderD130 -d --rm -p 8005:8005 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8005 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
137+
docker run --device /dev/dri/renderD131 -d --rm -p 8006:8006 -u 0 -v $(pwd)/models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8006 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
138+
```
139+
Confirm in logs if the containers loaded the models successfully.
140+
141+
### Start Nginx load balancer
142+
143+
The configuration below is a basic example distributing the clients between two started instances.
144+
```
145+
worker_processes 16;
146+
worker_rlimit_nofile 40000;
147+
events {
148+
worker_connections 10000;
149+
}
150+
stream {
151+
upstream ovms-cluster {
152+
least_conn;
153+
server localhost:8003;
154+
server localhost:8004;
155+
server localhost:8005;
156+
server localhost:8006;
157+
}
158+
server {
159+
listen 80;
160+
proxy_pass ovms-cluster;
161+
}
162+
}
163+
164+
```
165+
Start the Nginx container with:
166+
```bash
167+
docker run -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro -d --net=host -p 80:80 nginx
168+
```
169+
170+
### Testing the scalability
171+
172+
Start benchmarking script like in [demo](../README.md), pointing to the load balancer port and host.
173+
```bash
174+
python benchmark_serving.py --host localhost --port 80 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 4000 --request-rate 20
175+
Initial test run completed. Starting main benchmark run...
176+
Traffic request rate: 20
177+
178+
============ Serving Benchmark Result ============
179+
Successful requests: 4000
180+
Benchmark duration (s): 241.01
181+
Total input tokens: 888467
182+
Total generated tokens: 729546
183+
Request throughput (req/s): 16.60
184+
Output token throughput (tok/s): 3027.02
185+
Total Token throughput (tok/s): 6713.44
186+
---------------Time to First Token----------------
187+
Mean TTFT (ms): 1286.58
188+
Median TTFT (ms): 931.86
189+
P99 TTFT (ms): 4392.03
190+
-----Time per Output Token (excl. 1st token)------
191+
Mean TPOT (ms): 92.25
192+
Median TPOT (ms): 97.33
193+
P99 TPOT (ms): 122.52
194+
```
195+
196+
197+
## Multi GPU configuration loading models exceeding a single card VRAM
198+
199+
It is possible to load models bigger in size from the single GPU card capacity.
200+
Below is an example of the deployment 32B parameters LLM model on 2 BMG cards.
201+
This configuration currently doesn't support continuous batching. It process the requests sequentially so it can be use effectively with a single client use case.
202+
Continuous batching with Multi GPU configuration will be added soon.
203+
204+
### Start the Model Server instances
205+
206+
Export the model:
207+
```bash
208+
python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_name DeepSeek-R1-Distill-Qwen-32B_INT4 --weight-format int4 --model_repository_path models --target_device HETERO:GPU.0,GPU.1 --pipeline_type LM
209+
```
210+
211+
```bash
212+
docker run --device /dev/dri -d --rm -p 8000:8000 -u 0 -v $(pwd)/models/DeepSeek-R1-Distill-Qwen-32B_INT4:/model:ro openvino/model_server:latest --rest_port 8000 --model_name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_path /model
213+
```
214+
215+
### Testing the scalability
216+
217+
Start benchmarking script like in [demo](../README.md), pointing to the load balancer port and host.
218+
219+
```bash
220+
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --max-concurrency 1
221+
222+
============ Serving Benchmark Result ============
223+
Successful requests: 10
224+
Benchmark duration (s): 232.18
225+
Total input tokens: 1372
226+
Total generated tokens: 2287
227+
Request throughput (req/s): 0.04
228+
Output token throughput (tok/s): 9.85
229+
Total Token throughput (tok/s): 15.76
230+
--------------Time to First Token---------------
231+
Mean TTFT (ms): 732.52
232+
Median TTFT (ms): 466.59
233+
P99 TTFT (ms): 1678.66
234+
----Time per Output Token (excl. 1st token)-----
235+
Mean TPOT (ms): 64.23
236+
Median TPOT (ms): 52.06
237+
P99 TPOT (ms): 132.36
238+
```
93239

94-
TBD

0 commit comments

Comments
 (0)