llama.cpp vs vllm performance comparison #15180

JohannesGaessler · 2025-08-08T18:33:47Z

JohannesGaessler
Aug 8, 2025
Collaborator

I benchmarked llama.cpp vs. vllm. The TL;DR is that in the space that I tested llama.cpp needed 93.6-100.2% of the time to finish a request that vllm did for a single parallel request or 99.2-125.6% of the time for 16 parallel requests.

Methodology

Using the newly updated scripts/server-bench.py I sent parallel requests to the llama.cpp server (with LLAMA_SET_ROWS=1) or the vllm server (without FlashInfer) using the OAI-compatible APIs. Both servers served Qwen 2.5 Instruct 3b, vllm with BF16, llama.cpp with FP16 (because the support for BF16 still has some issues, vllm seems to be the same speed for FP16 and BF16). The hardware was single RTX 4090s frequency limited to 1350 MHz. Each server received a fixed number of concurrent requests with a fixed number of prompt tokens and generation tokens. For 1/16 requests a maximum context of 31744/25600 tokens was probed. The number of requests per run was 32 x the number of parallel requests, for each datapoint 6 independent runs were averaged. To separate the effects of the prompt length and the generation length, the following function was fit to the data:

$$ \mathrm{rt} = C + n_p \cdot (p_0 + p_c \cdot (n_p / 2)) + n_g \cdot (g_0 + g_c \cdot (n_p + n_g / 2)), $$

where $\mathrm{rt}$ is the runtime in seconds, $n_p$ / $n_g$ are the numbers of prompt/generation tokens, $p_0$ / $g_0$ are the base runtime per prompt/generation token, and $p_c$ / $g_c$ are the runtimes per prompt/generation token and per context depth. In effect this function fits a runtime with a constant part per token (weights) and a runtime proportional to context depth (attention). Fits were done using kafe2.

Commands

vllm serve Qwen/Qwen2.5-3B-Instruct --port $((8080 + $CUDA_VISIBLE_DEVICES))
./build/bin/llama-server --model models/opt/qwen_2.5_instruct-3b-f16.gguf -ngl 999 --host 0.0.0.0 --port $((8080 + $CUDA_VISIBLE_DEVICES)) -fa -c 440000 --parallel 1

Results

Data

Parallel requests	Gen. tokens	Prompt tokens	Runtime vllm	Runtime llama.cpp	Difference
1	256	2048	94.4	90.0	-4.7%
1	256	4096	102.4	99.1	-3.1%
1	256	6144	110.9	107.3	-3.2%
1	256	8192	120.8	115.8	-4.2%
1	256	10240	131.1	123.9	-5.5%
1	256	12288	141.0	132.7	-5.9%
1	256	14336	151.6	141.9	-6.4%
1	256	16384	158.7	151.7	-4.4%
1	256	18432	171.4	161.6	-5.7%
1	256	20480	182.3	172.4	-5.5%
1	256	22528	194.3	183.2	-5.7%
1	256	24576	204.7	194.8	-4.8%
1	256	26624	217.4	205.7	-5.4%
1	256	28672	231.3	218.8	-5.4%
1	256	30720	245.8	231.6	-5.8%
1	512	2048	184.6	176.5	-4.4%
1	512	4096	194.7	188.5	-3.2%
1	512	6144	204.9	199.0	-2.9%
1	512	8192	217.9	210.3	-3.5%
1	512	10240	230.7	219.8	-4.7%
1	512	12288	242.4	230.5	-4.9%
1	512	14336	254.8	241.5	-5.2%
1	512	16384	260.5	253.2	-2.8%
1	512	18432	275.7	264.9	-3.9%
1	512	20480	288.1	277.8	-3.6%
1	512	22528	300.8	290.5	-3.4%
1	512	24576	310.5	303.9	-2.1%
1	512	26624	324.4	317.0	-2.3%
1	512	28672	339.7	331.4	-2.4%
1	512	30720	364.3	346.2	-5.0%
1	768	2048	275.5	264.0	-4.2%
1	768	4096	287.3	277.9	-3.3%
1	768	6144	299.2	290.8	-2.8%
1	768	8192	315.2	304.8	-3.3%
1	768	10240	330.6	315.9	-4.4%
1	768	12288	344.1	328.5	-4.5%
1	768	14336	358.1	341.3	-4.7%
1	768	16384	362.4	355.0	-2.0%
1	768	18432	380.2	368.5	-3.1%
1	768	20480	393.7	383.3	-2.6%
1	768	22528	407.2	398.1	-2.2%
1	768	24576	416.0	413.2	-0.7%
1	768	26624	431.3	428.3	-0.7%
1	768	28672	447.2	444.3	-0.7%
1	768	30720	464.9	461.1	-0.8%
1	1024	2048	365.9	351.1	-4.1%
1	1024	4096	379.8	367.3	-3.3%
1	1024	6144	394.1	382.8	-2.9%
1	1024	8192	412.8	399.5	-3.2%
1	1024	10240	430.8	412.4	-4.3%
1	1024	12288	446.7	426.6	-4.5%
1	1024	14336	462.5	441.3	-4.6%
1	1024	16384	465.1	457.1	-1.7%
1	1024	18432	485.3	472.4	-2.6%
1	1024	20480	498.9	488.9	-2.0%
1	1024	22528	513.6	505.9	-1.5%
1	1024	24576	522.1	522.6	+0.1%
1	1024	26624	538.9	539.6	+0.1%
1	1024	28672	556.4	557.6	+0.2%
1	1024	30720	574.9	576.1	+0.2%
16	256	2048	215.3	265.5	+23.3%
16	256	4096	327.4	391.1	+19.5%
16	256	6144	455.5	526.5	+15.6%
16	256	8192	598.7	679.5	+13.5%
16	256	10240	762.4	835.2	+9.5%
16	256	12288	941.5	1013.3	+7.6%
16	256	14336	1142.1	1203.9	+5.4%
16	256	16384	1356.1	1409.0	+3.9%
16	256	18432	1593.9	1632.3	+2.4%
16	256	20480	1845.7	1867.9	+1.2%
16	256	22528	2113.9	2117.7	+0.2%
16	256	24576	2399.3	2380.1	-0.8%
16	512	2048	353.1	439.8	+24.6%
16	512	4096	479.7	584.1	+21.8%
16	512	6144	622.2	744.6	+19.7%
16	512	8192	775.9	912.9	+17.7%
16	512	10240	952.9	1096.2	+15.0%
16	512	12288	1147.9	1296.5	+12.9%
16	512	14336	1361.1	1507.6	+10.8%
16	512	16384	1586.0	1734.5	+9.4%
16	512	18432	1838.2	1978.9	+7.7%
16	512	20480	2104.5	2235.3	+6.2%
16	512	22528	2384.6	2508.5	+5.2%
16	512	24576	2681.2	2795.3	+4.3%
16	768	2048	493.7	616.6	+24.9%
16	768	4096	634.5	783.7	+23.5%
16	768	6144	789.3	966.7	+22.5%
16	768	8192	954.4	1156.0	+21.1%
16	768	10240	1148.0	1363.1	+18.7%
16	768	12288	1355.5	1583.8	+16.8%
16	768	14336	1580.9	1815.8	+14.9%
16	768	16384	1817.6	2065.1	+13.6%
16	768	18432	2086.1	2331.6	+11.8%
16	768	20480	2364.3	2609.7	+10.4%
16	768	22528	2656.3	2905.4	+9.4%
16	768	24576	2966.9	3216.5	+8.4%
16	1024	2048	633.8	796.1	+25.6%
16	1024	4096	789.9	983.4	+24.5%
16	1024	6144	956.7	1190.9	+24.5%
16	1024	8192	1136.7	1402.6	+23.4%
16	1024	10240	1343.9	1627.8	+21.1%
16	1024	12288	1563.2	1870.8	+19.7%
16	1024	14336	1802.3	2127.1	+18.0%
16	1024	16384	2053.6	2394.3	+16.6%
16	1024	18432	2334.3	2684.0	+15.0%
16	1024	20480	2624.9	2983.6	+13.7%
16	1024	22528	2931.1	3299.0	+12.6%
16	1024	24576	3285.3	3640.7	+10.8%

Fit

1 concurrent request:

16 concurrent requests:

Parallel requests	Backend	$C$ [s]	$p_0$ [s/t]	$p_c$ [s/(t*c)]	$g_0$ [s/t]	$g_c$ [s/(t*c)]
1	vllm	6.37186e-16	9.14306e-05	2.63193e-09	0.0108264	9.75157e-08
1	llama.cpp	1.30327e-08	7.44224e-05	2.79963e-09	0.0103323	1.20595e-07
16	vllm	0.00055068	6.69729e-05	8.23671e-09	0.000935629	4.97072e-08
16	llama.cpp	0.0208679	7.49365e-05	6.54007e-09	0.00112935	8.13797e-08

The runtimes are overall relatively close. I think the llama.cpp performance for 16 parallel requests could be improved by reducing the constant runtime per generated token. One thing that could be done is move some of the samplers like top-k, top-p, and min-p into the ggml graph in order to cut down on the token candidates before passing them to the rest of the sampler chain. Also more operation fusion and using FP16/BF16 for the ggml graphs would probably help. I'm not sure how reliable the estimates of the runtime per token and context depth are since the model is a poor fit to the vllm data; sadly I was not able to probe vllm at very deep contexts since the CUDA backend would crash.

ggerganov · 2025-08-08T20:29:46Z

ggerganov
Aug 8, 2025
Maintainer

vllm server (without FlashInfer)

What is FlashInfer and why it wasn't used?

4 replies

JohannesGaessler Aug 8, 2025
Collaborator Author

It's an optional package for vllm. It's not being installed automatically when you do the default vllm installation via pip so you get this warning:

WARNING 08-08 22:36:00 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.

Apparently it has to be installed from a git repository. This is an additional complication so my thinking was that I would first benchmark the lowest-effort vllm installation and then check afterwards whether this even makes a meaningful difference.

KMouratidis Aug 8, 2025

You should be able to easily install it with pip: pip3 install flashinfer-python. flashinfer (repo) is easy to install. Are you sure you're not confusing it with flash-attention (repo) (it can also be installed with pip, just takes quite some time)?

JohannesGaessler Aug 8, 2025
Collaborator Author

I have since looked at the flashinfer repository in more detail and seen the instructions for pip installation. I've installed it on my server and started a new benchmark run for comparison. My assumption was that since the default vllm installation comes without it an automatic installation wouldn't be possible.

JohannesGaessler Aug 10, 2025
Collaborator Author

I made sure that FlashInfer is installed and repeated the benchmark. I got the same performance as before, though looking at the logs this was because of a broken version check. The fix is not yet in the latest release, when I manually patched my installation to ignore the version check the performance got worse vs. FlashInfer not being installed. I'll wait till the next official pre-release to reproduce the issue without modification and then I'll report it to the devs.

ExtReMLapin · 2025-08-11T06:31:50Z

ExtReMLapin
Aug 11, 2025

VLLM used to have a better production readiness level because of paged attention and parallel processing just working better.

It should not be that different since now we have the "high throughput" mode.

But fucking hell, installing VLLM is always a pain in the ass (takes time) because it internally compiles it's own FA2 build, which contain special attention modes like Dual Chunk Attention, Sparse attention etc

If you do the benchmark on longer prompts (+60k) with DCA, VLLM will easily beat llama.cpp

You can find the python+cuda implementation of Dual Chunk attention here :

vllm-project/vllm#11844

Now that would be interresting to use flashinfer or dual chunk attention.

2 replies

meb427 Sep 29, 2025

Is there a blocker for implementing DCA in llama.cpp?

ExtReMLapin Sep 29, 2025

Is there a blocker for implementing DCA in llama.cpp?

Volunteers

Mushoz · 2025-09-07T14:37:26Z

Mushoz
Sep 7, 2025

Have you done any comparisons with MOE models of vLLM vs llamacpp? I think the differences in parallel execution will be much bigger there, but maybe I am mistaken? Especially with the big focus on MOE models in general lately it might be a better comparison than a dense model.

7 replies

Mushoz Sep 7, 2025

I know these are old models so quality will probably suck, but they are perfect for performance benchmarking at fp16

Mushoz Sep 7, 2025

Also, I reckon gpt-oss-20b could work? Since its natively quantized you would be running identical quants.

JohannesGaessler Sep 7, 2025
Collaborator Author

Thank you for the suggestions, I've added benchmarking one of those models to my to-do list.

Mushoz Sep 7, 2025

No worries! Out of these, I reckon gpt-oss-20b would be the most interesting one as that one is still seeing real world usage right now. If the native quantization doesn't give any problems of course.

JohannesGaessler Sep 7, 2025
Collaborator Author

Sorry, but my primary interest isn't producing numbers that are directly applicable to real-world usage, my primary interest is determining which parts of the code are slower/faster and could potentially still be optimized. For that I need to be able to fit the model with large contexts so GPT-OSS 20b is not suitable.

llama.cpp vs vllm performance comparison #15180

Uh oh!

Uh oh!

JohannesGaessler Aug 8, 2025 Collaborator

Methodology

Results

Replies: 3 comments · 13 replies

Uh oh!

ggerganov Aug 8, 2025 Maintainer

Uh oh!

JohannesGaessler Aug 8, 2025 Collaborator Author

Uh oh!

Uh oh!

JohannesGaessler Aug 8, 2025 Collaborator Author

Uh oh!

Uh oh!

JohannesGaessler Aug 10, 2025 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Sep 7, 2025 Collaborator Author

Uh oh!

Uh oh!

JohannesGaessler Sep 7, 2025 Collaborator Author

JohannesGaessler
Aug 8, 2025
Collaborator

Replies: 3 comments 13 replies

ggerganov
Aug 8, 2025
Maintainer

JohannesGaessler Aug 8, 2025
Collaborator Author

JohannesGaessler Aug 8, 2025
Collaborator Author

JohannesGaessler Aug 10, 2025
Collaborator Author

JohannesGaessler Sep 7, 2025
Collaborator Author

JohannesGaessler Sep 7, 2025
Collaborator Author