You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see currently 2 of the 3 pinned items on the mainline llama.cppdiscussion page are devoted to GPT-OSS, so I thought it is useful to have one here too.
Support
Support for the GPT-OSS models was added to ik_llama.cpp in PR #689. Support for MAXFP4, the fp4 variant used in these models was added in #682. Improvements for the Harmony chat template and tool call support followed in #677 and #723.
Performance for these models was improved via SWA-specific optimizations in #692, #754 and #757. Hybrid inference PP performance was increased in #698 by offloading only the activated experts to the GPU (a feature that appears to be only useful for the GPT-OSS models).
Performance
There are two main advantages of ik_llama.cpp over llama.cpp: availability of higher quality quantization types and better performance. Hence, it makes sense to compare ik_llama.cpp and llama.cpp performance for GPT-OSS. I'll only provide results for the native MXFP4 quantization type. All results reported here are for the 20B variant.
The GPT-OSS models use sliding window attention (SWA) for every second layer, with a fairly small windows size of 128 tokens. There is a significant difference between ik_llama.cpp and llama.cpp how SWA layers are handled. In mainline, by default only the tokens required to compute the last u-batch are kept in the cache. This has the distinct advantage of reducing KV-cache size by nearly a factor of 2. The downside of this is that the KV-cache is not reusable. To be able to reuse the KV-cache, one needs to add --swa-full to the llama.cpp command line. This keeps the full KV-cache also for SWA layers, and corresponds to what is done in ik_llama.cpp. Hence, when comparing performance, I'll provide llama.cpp results with and without --swas-full, the latter being a more fair comparison (due to the usability limitations one gets without --swa-full).
CUDA performance, GPT-OSS-20B-MXFP4
Tests are run on an RTX-4080 GPU in a Ryzen-7950X rig.
(the arguments in square brackets only apply to ik_llama.cpp).
PP performance
Here ik_llama.cpp is ~40% faster at small N_KV, and 65% faster at 30k tokens compared to the corresponding calculation with llama.cpp using --swa-full.
TG performance
Here performance is similar at low context length, but ik_llama.cpp is ~15% faster at 30k tokens compared to --swa-full. llama.cpp manages to outperform ik_llama.cpp by a small margin (4.5% at 30k tokens) for large N_KV when not using --swa-full. I think this is simply due to the much smaller KQ mask for the SWA layers that has to be copied to the GPU for each token prediction. The ik_llama.cpp mask is 4 MiB at 32k tokens, which takes 0.27 ms for a 15 GiB/s PCI-E, which is ~3.5% of the time taken per token generation at 130 t/s. Without --swa-full the llama.cpp KQ SWA mask is ~15X smaller, so time to copy to the GPU is negligible.
llama.cpp with --swa-full
PP
TG
N_KV
T_PP s
S_PP t/s
T_TG s
S_TG t/s
2048
512
0
0.314
6530.99
2.751
186.09
2048
512
2048
0.300
6827.90
2.988
171.33
2048
512
4096
0.326
6275.32
3.203
159.87
2048
512
6144
0.353
5799.25
3.421
149.65
2048
512
8192
0.385
5322.84
3.316
154.43
2048
512
10240
0.415
4931.17
3.444
148.67
2048
512
12288
0.449
4560.65
3.599
142.27
2048
512
14336
0.473
4327.45
3.693
138.63
2048
512
16384
0.506
4044.18
3.798
134.81
2048
512
18432
0.534
3832.20
3.907
131.05
2048
512
20480
0.563
3634.44
4.018
127.44
2048
512
22528
0.587
3485.98
4.131
123.94
2048
512
24576
0.614
3334.87
4.261
120.15
2048
512
26624
0.652
3142.10
4.351
117.69
2048
512
28672
0.677
3023.10
4.463
114.71
llama.cpp without --swa-full
PP
TG
N_KV
T_PP s
S_PP t/s
T_TG s
S_TG t/s
2048
512
0
0.312
6560.28
2.751
186.13
2048
512
2048
0.285
7181.66
2.984
171.59
2048
512
4096
0.303
6762.65
3.091
165.66
2048
512
6144
0.315
6504.91
3.204
159.80
2048
512
8192
0.335
6111.65
3.155
162.28
2048
512
10240
0.347
5910.30
3.210
159.48
2048
512
12288
0.365
5616.79
3.288
155.74
2048
512
14336
0.385
5316.97
3.330
153.73
2048
512
16384
0.402
5092.58
3.388
151.10
2048
512
18432
0.411
4977.99
3.449
148.44
2048
512
20480
0.434
4724.01
3.499
146.32
2048
512
22528
0.450
4553.29
3.559
143.86
2048
512
24576
0.462
4434.42
3.623
141.31
2048
512
26624
0.482
4252.86
3.664
139.76
2048
512
28672
0.498
4109.87
3.724
137.49
ik_llama.cpp
PP
TG
N_KV
T_PP s
S_PP t/s
T_TG s
S_TG t/s
2048
512
0
0.205
10003.96
2.871
178.33
2048
512
2048
0.208
9826.36
2.964
172.75
2048
512
4096
0.230
8917.26
3.033
168.80
2048
512
6144
0.246
8330.35
3.122
164.01
2048
512
8192
0.263
7787.72
3.196
160.22
2048
512
10240
0.276
7409.15
3.274
156.40
2048
512
12288
0.292
7014.40
3.345
153.09
2048
512
14336
0.307
6661.16
3.412
150.08
2048
512
16384
0.322
6361.24
3.473
147.41
2048
512
18432
0.336
6102.96
3.543
144.53
2048
512
20480
0.351
5832.50
3.615
141.63
2048
512
22528
0.367
5586.18
3.685
138.95
2048
512
24576
0.381
5373.61
3.762
136.11
2048
512
26624
0.396
5174.20
3.833
133.59
2048
512
28672
0.410
4991.07
3.889
131.67
CPU performance, GPT-OSS-20B-MXFP4
Tests are run on a Ryzen-7950X CPU. The command line used is
Not much comment is needed here. ik_llama.cpp is 3.6X faster at zero context, and ~8.8X faster at 25k tokens (I did not have the patience to wait for the llama.cpp results up two 32k tokens). There is no difference with and without --swa-full when running llama.cpp on the CPU
TG performance
Here performance is about the same at zero context length. At 25k tokens ik_llama.cpp is about 1.5X faster.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I see currently 2 of the 3 pinned items on the mainline
llama.cpp
discussion page are devoted to GPT-OSS, so I thought it is useful to have one here too.Support
Support for the
GPT-OSS
models was added toik_llama.cpp
in PR #689. Support forMAXFP4
, thefp4
variant used in these models was added in #682. Improvements for the Harmony chat template and tool call support followed in #677 and #723.Performance for these models was improved via SWA-specific optimizations in #692, #754 and #757. Hybrid inference PP performance was increased in #698 by offloading only the activated experts to the GPU (a feature that appears to be only useful for the GPT-OSS models).
Performance
There are two main advantages of
ik_llama.cpp
overllama.cpp
: availability of higher quality quantization types and better performance. Hence, it makes sense to compareik_llama.cpp
andllama.cpp
performance for GPT-OSS. I'll only provide results for the nativeMXFP4
quantization type. All results reported here are for the 20B variant.The GPT-OSS models use sliding window attention (SWA) for every second layer, with a fairly small windows size of 128 tokens. There is a significant difference between
ik_llama.cpp
andllama.cpp
how SWA layers are handled. In mainline, by default only the tokens required to compute the lastu-batch
are kept in the cache. This has the distinct advantage of reducing KV-cache size by nearly a factor of 2. The downside of this is that the KV-cache is not reusable. To be able to reuse the KV-cache, one needs to add--swa-full
to thellama.cpp
command line. This keeps the full KV-cache also for SWA layers, and corresponds to what is done inik_llama.cpp
. Hence, when comparing performance, I'll providellama.cpp
results with and without--swas-full
, the latter being a more fair comparison (due to the usability limitations one gets without--swa-full
).CUDA performance, GPT-OSS-20B-MXFP4
Tests are run on an RTX-4080 GPU in a Ryzen-7950X rig.
Command line is
(the arguments in square brackets only apply to
ik_llama.cpp
).PP performance
Here
ik_llama.cpp
is ~40% faster at smallN_KV
, and 65% faster at 30k tokens compared to the corresponding calculation withllama.cpp
using--swa-full
.TG performance
Here performance is similar at low context length, but
ik_llama.cpp
is ~15% faster at 30k tokens compared to--swa-full
.llama.cpp
manages to outperformik_llama.cpp
by a small margin (4.5% at 30k tokens) for largeN_KV
when not using--swa-full
. I think this is simply due to the much smaller KQ mask for the SWA layers that has to be copied to the GPU for each token prediction. Theik_llama.cpp
mask is 4 MiB at 32k tokens, which takes 0.27 ms for a 15 GiB/s PCI-E, which is ~3.5% of the time taken per token generation at 130 t/s. Without--swa-full
thellama.cpp
KQ SWA mask is ~15X smaller, so time to copy to the GPU is negligible.llama.cpp with --swa-full
llama.cpp without --swa-full
ik_llama.cpp
CPU performance, GPT-OSS-20B-MXFP4
Tests are run on a Ryzen-7950X CPU. The command line used is
PP performance
Not much comment is needed here.
ik_llama.cpp
is 3.6X faster at zero context, and ~8.8X faster at 25k tokens (I did not have the patience to wait for thellama.cpp
results up two 32k tokens). There is no difference with and without--swa-full
when runningllama.cpp
on the CPUTG performance
Here performance is about the same at zero context length. At 25k tokens
ik_llama.cpp
is about 1.5X faster.llama.cpp with --swa-full
llama.cpp without --swa-full
ik_llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions