llama.cpp vs vllm performance comparison #15180
Replies: 2 comments 4 replies
-
What is FlashInfer and why it wasn't used? |
Beta Was this translation helpful? Give feedback.
-
VLLM used to have a better production readiness level because of paged attention and parallel processing just working better. It should not be that different since now we have the "high throughput" mode. But fucking hell, installing VLLM is always a pain in the ass (takes time) because it internally compiles it's own FA2 build, which contain special attention modes like Dual Chunk Attention, Sparse attention etc If you do the benchmark on longer prompts (+60k) with DCA, VLLM will easily beat llama.cpp You can find the python+cuda implementation of Dual Chunk attention here : Now that would be interresting to use flashinfer or dual chunk attention. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I benchmarked llama.cpp vs. vllm. The TL;DR is that in the space that I tested llama.cpp needed 93.6-100.2% of the time to finish a request that vllm did for a single parallel request or 99.2-125.6% of the time for 16 parallel requests.
Methodology
Using the newly updated
scripts/server-bench.py
I sent parallel requests to the llama.cpp server (withLLAMA_SET_ROWS=1
) or the vllm server (without FlashInfer) using the OAI-compatible APIs. Both servers served Qwen 2.5 Instruct 3b, vllm with BF16, llama.cpp with FP16 (because the support for BF16 still has some issues, vllm seems to be the same speed for FP16 and BF16). The hardware was single RTX 4090s frequency limited to 1350 MHz. Each server received a fixed number of concurrent requests with a fixed number of prompt tokens and generation tokens. For 1/16 requests a maximum context of 31744/25600 tokens was probed. The number of requests per run was 32 x the number of parallel requests, for each datapoint 6 independent runs were averaged. To separate the effects of the prompt length and the generation length, the following function was fit to the data:where$\mathrm{rt}$ is the runtime in seconds, $n_p$ / $n_g$ are the numbers of prompt/generation tokens, $p_0$ / $g_0$ are the base runtime per prompt/generation token, and $p_c$ / $g_c$ are the runtimes per prompt/generation token and per context depth. In effect this function fits a runtime with a constant part per token (weights) and a runtime proportional to context depth (attention). Fits were done using kafe2.
Commands
Results
Data
Fit
1 concurrent request:
16 concurrent requests:
The runtimes are overall relatively close. I think the llama.cpp performance for 16 parallel requests could be improved by reducing the constant runtime per generated token. One thing that could be done is move some of the samplers like top-k, top-p, and min-p into the ggml graph in order to cut down on the token candidates before passing them to the rest of the sampler chain. Also more operation fusion and using FP16/BF16 for the ggml graphs would probably help. I'm not sure how reliable the estimates of the runtime per token and context depth are since the model is a poor fit to the vllm data; sadly I was not able to probe vllm at very deep contexts since the CUDA backend would crash.
Beta Was this translation helpful? Give feedback.
All reactions