1
- # Scaling on a dual CPU socket server {#ovms_demos_continuous_batching_scaling}
1
+ # Scaling on a dual CPU socket server and multi-GPU hosts {#ovms_demos_continuous_batching_scaling}
2
+
3
+ ## Scaling on dual CPU sockets
2
4
3
5
> ** Note** : This demo uses Docker and has been tested only on Linux hosts
4
6
@@ -10,7 +12,7 @@ It deploys 6 instances of the model server allocated to different NUMA nodes on
10
12
11
13
![ drawing] ( ./loadbalancing.png )
12
14
13
- ## Start the Model Server instances
15
+ ### Start the Model Server instances
14
16
15
17
Let's assume we have two CPU sockets server with two NUMA nodes.
16
18
``` bash
@@ -23,21 +25,31 @@ NUMA node3 CPU(s): 96-127,288-319
23
25
NUMA node4 CPU(s): 128-159,320-351
24
26
NUMA node5 CPU(s): 160-191,352-383
25
27
```
26
- Following the prework from [ demo] ( ../README.md ) start the instances like below:
28
+
29
+ Export the model:
30
+ ``` bash
31
+ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/export_model.py -o export_model.py
32
+ pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/requirements.txt
33
+ mkdir models
34
+ python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --model_name Meta-Llama-3-8B-Instruct_FP16 --weight-format fp16 --model_repository_path models
35
+ ```
36
+
27
37
``` bash
28
- docker run --cpuset-cpus $( lscpu | grep node0 | cut -d: -f2) -d --rm -p 8003:8003 -v $( pwd) /models:/workspace :ro openvino/model_server:latest --rest_port 8003 --config_path /workspace/config.json
29
- docker run --cpuset-cpus $( lscpu | grep node1 | cut -d: -f2) -d --rm -p 8004:8004 -v $( pwd) /models:/workspace :ro openvino/model_server:latest --rest_port 8004 --config_path /workspace/config.json
30
- docker run --cpuset-cpus $( lscpu | grep node2 | cut -d: -f2) -d --rm -p 8005:8005 -v $( pwd) /models:/workspace :ro openvino/model_server:latest --rest_port 8005 --config_path /workspace/config.json
31
- docker run --cpuset-cpus $( lscpu | grep node3 | cut -d: -f2) -d --rm -p 8006:8006 -v $( pwd) /models:/workspace :ro openvino/model_server:latest --rest_port 8006 --config_path /workspace/config.json
32
- docker run --cpuset-cpus $( lscpu | grep node4 | cut -d: -f2) -d --rm -p 8007:8007 -v $( pwd) /models:/workspace :ro openvino/model_server:latest --rest_port 8007 --config_path /workspace/config.json
33
- docker run --cpuset-cpus $( lscpu | grep node5 | cut -d: -f2) -d --rm -p 8008:8008 -v $( pwd) /models:/workspace :ro openvino/model_server:latest --rest_port 8008 --config_path /workspace/config.json
38
+ docker run --cpuset-cpus $( lscpu | grep node0 | cut -d: -f2) -d --rm -p 8003:8003 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_FP16:/model :ro openvino/model_server:latest --rest_port 8003 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
39
+ docker run --cpuset-cpus $( lscpu | grep node1 | cut -d: -f2) -d --rm -p 8004:8004 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_FP16:/model :ro openvino/model_server:latest --rest_port 8004 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
40
+ docker run --cpuset-cpus $( lscpu | grep node2 | cut -d: -f2) -d --rm -p 8005:8005 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_FP16:/model :ro openvino/model_server:latest --rest_port 8005 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
41
+ docker run --cpuset-cpus $( lscpu | grep node3 | cut -d: -f2) -d --rm -p 8006:8006 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_FP16:/model :ro openvino/model_server:latest --rest_port 8006 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
42
+ docker run --cpuset-cpus $( lscpu | grep node4 | cut -d: -f2) -d --rm -p 8007:8007 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_FP16:/model :ro openvino/model_server:latest --rest_port 8007 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
43
+ docker run --cpuset-cpus $( lscpu | grep node5 | cut -d: -f2) -d --rm -p 8008:8008 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_FP16:/model :ro openvino/model_server:latest --rest_port 8008 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
34
44
```
35
45
Confirm in logs if the containers loaded the models successfully.
36
46
37
- ## Start Nginx load balancer
47
+ ### Start Nginx load balancer
38
48
39
- The configuration below is a basic example distributing the clients between two started instances.
49
+ The configuration below is a basic example distributing the clients between six started instances.
40
50
```
51
+ worker_processes 16;
52
+ worker_rlimit_nofile 40000;
41
53
events {
42
54
worker_connections 10000;
43
55
}
@@ -63,10 +75,15 @@ Start the Nginx container with:
63
75
docker run -v $( pwd) /nginx.conf:/etc/nginx/nginx.conf:ro -d --net=host -p 80:80 nginx
64
76
```
65
77
66
- ## Testing the scalability
78
+ ### Testing the scalability
67
79
68
- Start benchmarking script like in [ demo ] ( ../README.md ) , pointing to the load balancer port and host.
80
+ Let's use the benchmark_serving script from vllm repository:
69
81
``` bash
82
+ git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
83
+ cd vllm
84
+ pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
85
+ cd benchmarks
86
+ curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json
70
87
python benchmark_serving.py --host localhost --port 80 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 6000 --request-rate 20
71
88
Initial test run completed. Starting main benchmark run...
72
89
Traffic request rate: 20
@@ -89,6 +106,134 @@ Median TPOT (ms): 238.49
89
106
P99 TPOT (ms): 261.74
90
107
```
91
108
92
- # Scaling in Kubernetes
109
+ ## Scaling horizontally on a multi GPU host
110
+
111
+ Throughput scalability on multi GPU systems can be achieved by starting multiple instances assigned to each card. The commands below were executed on a host with 4 Battlemage B580 GPU cards.
112
+
113
+ ### Start the Model Server instances
114
+
115
+ ``` bash
116
+ ls -1 /dev/dri/
117
+ by-path
118
+ card0
119
+ card1
120
+ card2
121
+ card3
122
+ renderD128
123
+ renderD129
124
+ renderD130
125
+ renderD131
126
+ ```
127
+
128
+ Export the model:
129
+ ``` bash
130
+ python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --model_name Meta-Llama-3-8B-Instruct_INT4 --weight-format int4 --model_repository_path models --target_device GPU --cache 4
131
+ ```
132
+
133
+ ``` bash
134
+ docker run --device /dev/dri/renderD128 -d --rm -p 8003:8003 -u 0 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8003 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
135
+ docker run --device /dev/dri/renderD129 -d --rm -p 8004:8004 -u 0 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8004 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
136
+ docker run --device /dev/dri/renderD130 -d --rm -p 8005:8005 -u 0 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8005 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
137
+ docker run --device /dev/dri/renderD131 -d --rm -p 8006:8006 -u 0 -v $( pwd) /models/Meta-Llama-3-8B-Instruct_INT4:/model:ro openvino/model_server:latest --rest_port 8006 --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path /model
138
+ ```
139
+ Confirm in logs if the containers loaded the models successfully.
140
+
141
+ ### Start Nginx load balancer
142
+
143
+ The configuration below is a basic example distributing the clients between two started instances.
144
+ ```
145
+ worker_processes 16;
146
+ worker_rlimit_nofile 40000;
147
+ events {
148
+ worker_connections 10000;
149
+ }
150
+ stream {
151
+ upstream ovms-cluster {
152
+ least_conn;
153
+ server localhost:8003;
154
+ server localhost:8004;
155
+ server localhost:8005;
156
+ server localhost:8006;
157
+ }
158
+ server {
159
+ listen 80;
160
+ proxy_pass ovms-cluster;
161
+ }
162
+ }
163
+
164
+ ```
165
+ Start the Nginx container with:
166
+ ``` bash
167
+ docker run -v $( pwd) /nginx.conf:/etc/nginx/nginx.conf:ro -d --net=host -p 80:80 nginx
168
+ ```
169
+
170
+ ### Testing the scalability
171
+
172
+ Start benchmarking script like in [ demo] ( ../README.md ) , pointing to the load balancer port and host.
173
+ ``` bash
174
+ python benchmark_serving.py --host localhost --port 80 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 4000 --request-rate 20
175
+ Initial test run completed. Starting main benchmark run...
176
+ Traffic request rate: 20
177
+
178
+ ============ Serving Benchmark Result ============
179
+ Successful requests: 4000
180
+ Benchmark duration (s): 241.01
181
+ Total input tokens: 888467
182
+ Total generated tokens: 729546
183
+ Request throughput (req/s): 16.60
184
+ Output token throughput (tok/s): 3027.02
185
+ Total Token throughput (tok/s): 6713.44
186
+ ---------------Time to First Token----------------
187
+ Mean TTFT (ms): 1286.58
188
+ Median TTFT (ms): 931.86
189
+ P99 TTFT (ms): 4392.03
190
+ -----Time per Output Token (excl. 1st token)------
191
+ Mean TPOT (ms): 92.25
192
+ Median TPOT (ms): 97.33
193
+ P99 TPOT (ms): 122.52
194
+ ```
195
+
196
+
197
+ ## Multi GPU configuration loading models exceeding a single card VRAM
198
+
199
+ It is possible to load models bigger in size from the single GPU card capacity.
200
+ Below is an example of the deployment 32B parameters LLM model on 2 BMG cards.
201
+ This configuration currently doesn't support continuous batching. It process the requests sequentially so it can be use effectively with a single client use case.
202
+ Continuous batching with Multi GPU configuration will be added soon.
203
+
204
+ ### Start the Model Server instances
205
+
206
+ Export the model:
207
+ ``` bash
208
+ python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_name DeepSeek-R1-Distill-Qwen-32B_INT4 --weight-format int4 --model_repository_path models --target_device HETERO:GPU.0,GPU.1 --pipeline_type LM
209
+ ```
210
+
211
+ ``` bash
212
+ docker run --device /dev/dri -d --rm -p 8000:8000 -u 0 -v $( pwd) /models/DeepSeek-R1-Distill-Qwen-32B_INT4:/model:ro openvino/model_server:latest --rest_port 8000 --model_name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --model_path /model
213
+ ```
214
+
215
+ ### Testing the scalability
216
+
217
+ Start benchmarking script like in [ demo] ( ../README.md ) , pointing to the load balancer port and host.
218
+
219
+ ``` bash
220
+ python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --max-concurrency 1
221
+
222
+ ============ Serving Benchmark Result ============
223
+ Successful requests: 10
224
+ Benchmark duration (s): 232.18
225
+ Total input tokens: 1372
226
+ Total generated tokens: 2287
227
+ Request throughput (req/s): 0.04
228
+ Output token throughput (tok/s): 9.85
229
+ Total Token throughput (tok/s): 15.76
230
+ --------------Time to First Token---------------
231
+ Mean TTFT (ms): 732.52
232
+ Median TTFT (ms): 466.59
233
+ P99 TTFT (ms): 1678.66
234
+ ----Time per Output Token (excl. 1st token)-----
235
+ Mean TPOT (ms): 64.23
236
+ Median TPOT (ms): 52.06
237
+ P99 TPOT (ms): 132.36
238
+ ```
93
239
94
- TBD
0 commit comments