Replies: 1 comment
-
Throughput Issues When Handling Multiple Concurrent Requests I'm currently experiencing a performance issue while deploying a large language model (LLM) on the backend. It appears the total system throughput is a nearly fixed number, approximately 200 tokens/second, which seems to be dependent on the existing hardware. When multiple requests are sent simultaneously—for instance, 3 or 4 requests—the throughput for each individual request drops significantly. Specifically, each request receives a throughput equal to the total fixed throughput divided by the number of concurrent requests. This results in a very slow response time for each user, which negatively impacts the overall user experience. I would greatly appreciate any support and shared experiences from those who have handled similar logic and deployment issues, as I aim to improve the system's performance. Thank you for your time and consideration. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a server with +10 GPU's, and over 1TB of RAM. I'm current running Mistral Small Instruct with vLLM v0.8.0 in Docker.
In v0.8.0 a new command for running benchmarks was added, when running these I get pretty much the same performance from running:
The only difference I see is that the logs say I have a way bigger KV cache size. Throughput wise I'm getting the same tokens generated and Req/s.
These are the flags i'm using:
What am I missing? Why is the throughput not increasing, is this a vLLM issue or a
vllm bench serve
issue?Is there any way of using some of that RAM to help vLLM? Right now the system is barely using 10GB of RAM.
Beta Was this translation helpful? Give feedback.
All reactions