How to achieve maximum performance? #15215

gaby · 2025-03-20T12:19:02Z

gaby
Mar 20, 2025

I have a server with +10 GPU's, and over 1TB of RAM. I'm current running Mistral Small Instruct with vLLM v0.8.0 in Docker.

In v0.8.0 a new command for running benchmarks was added, when running these I get pretty much the same performance from running:

vLLM with 1 GPU (1 req/s faster than multi-gpu)
vLLM with all the gpus and --tensor-parallel-size set to 4 or 6

The only difference I see is that the logs say I have a way bigger KV cache size. Throughput wise I'm getting the same tokens generated and Req/s.

These are the flags i'm using:

gpu_memory_utilization 0.95
tensor_parallel_size 1/4/6 makes no difference other than KV cache size
max_num_batched_tokens 2048, 8183, 32000 all made minimal difference
enable_chunked_prefill=True

What am I missing? Why is the throughput not increasing, is this a vLLM issue or a vllm bench serve issue?

Is there any way of using some of that RAM to help vLLM? Right now the system is barely using 10GB of RAM.

Doan-IT · 2025-08-08T09:18:13Z

Doan-IT
Aug 8, 2025

Throughput Issues When Handling Multiple Concurrent Requests

I'm currently experiencing a performance issue while deploying a large language model (LLM) on the backend. It appears the total system throughput is a nearly fixed number, approximately 200 tokens/second, which seems to be dependent on the existing hardware.

When multiple requests are sent simultaneously—for instance, 3 or 4 requests—the throughput for each individual request drops significantly. Specifically, each request receives a throughput equal to the total fixed throughput divided by the number of concurrent requests.

This results in a very slow response time for each user, which negatively impacts the overall user experience.

I would greatly appreciate any support and shared experiences from those who have handled similar logic and deployment issues, as I aim to improve the system's performance.

Thank you for your time and consideration.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to achieve maximum performance? #15215

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to achieve maximum performance? #15215

Uh oh!

gaby Mar 20, 2025

Replies: 1 comment

Uh oh!

Doan-IT Aug 8, 2025

gaby
Mar 20, 2025

Doan-IT
Aug 8, 2025