Replies: 1 comment 4 replies
-
So no GPT-OSS 120B support? At least I can't make it output coherent text - only gibberish. Otherwise provide to-the-point recipe: which quant and how to run it. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I see currently 2 of the 3 pinned items on the mainline
llama.cpp
discussion page are devoted to GPT-OSS, so I thought it is useful to have one here too.Support
Support for the
GPT-OSS
models was added toik_llama.cpp
in PR #689. Support forMAXFP4
, thefp4
variant used in these models was added in #682. Improvements for the Harmony chat template and tool call support followed in #677 and #723.Performance for these models was improved via SWA-specific optimizations in #692, #754 and #757. Hybrid inference PP performance was increased in #698 by offloading only the activated experts to the GPU (a feature that appears to be only useful for the GPT-OSS models).
Performance
There are two main advantages of
ik_llama.cpp
overllama.cpp
: availability of higher quality quantization types and better performance. Hence, it makes sense to compareik_llama.cpp
andllama.cpp
performance for GPT-OSS. I'll only provide results for the nativeMXFP4
quantization type. All results reported here are for the 20B variant.The GPT-OSS models use sliding window attention (SWA) for every second layer, with a fairly small windows size of 128 tokens. There is a significant difference between
ik_llama.cpp
andllama.cpp
how SWA layers are handled. In mainline, by default only the tokens required to compute the lastu-batch
are kept in the cache. This has the distinct advantage of reducing KV-cache size by nearly a factor of 2. The downside of this is that the KV-cache is not reusable. To be able to reuse the KV-cache, one needs to add--swa-full
to thellama.cpp
command line. This keeps the full KV-cache also for SWA layers, and corresponds to what is done inik_llama.cpp
. Hence, when comparing performance, I'll providellama.cpp
results with and without--swas-full
, the latter being a more fair comparison (due to the usability limitations one gets without--swa-full
).CUDA performance, GPT-OSS-20B-MXFP4
Tests are run on an RTX-4080 GPU in a Ryzen-7950X rig.
Command line is
(the arguments in square brackets only apply to
ik_llama.cpp
).PP performance
Here
ik_llama.cpp
is ~40% faster at smallN_KV
, and 65% faster at 30k tokens compared to the corresponding calculation withllama.cpp
using--swa-full
.TG performance
Here performance is similar at low context length, but
ik_llama.cpp
is ~15% faster at 30k tokens compared to--swa-full
.llama.cpp
manages to outperformik_llama.cpp
by a small margin (4.5% at 30k tokens) for largeN_KV
when not using--swa-full
. I think this is simply due to the much smaller KQ mask for the SWA layers that has to be copied to the GPU for each token prediction. Theik_llama.cpp
mask is 4 MiB at 32k tokens, which takes 0.27 ms for a 15 GiB/s PCI-E, which is ~3.5% of the time taken per token generation at 130 t/s. Without--swa-full
thellama.cpp
KQ SWA mask is ~15X smaller, so time to copy to the GPU is negligible.llama.cpp with --swa-full
llama.cpp without --swa-full
ik_llama.cpp
CPU performance, GPT-OSS-20B-MXFP4
Tests are run on a Ryzen-7950X CPU. The command line used is
PP performance
Not much comment is needed here.
ik_llama.cpp
is 3.6X faster at zero context, and ~8.8X faster at 25k tokens (I did not have the patience to wait for thellama.cpp
results up two 32k tokens). There is no difference with and without--swa-full
when runningllama.cpp
on the CPUTG performance
Here performance is about the same at zero context length. At 25k tokens
ik_llama.cpp
is about 1.5X faster.llama.cpp with --swa-full
llama.cpp without --swa-full
ik_llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions