Replies: 1 comment 4 replies
-
|
So no GPT-OSS 120B support? At least I can't make it output coherent text - only gibberish. Otherwise provide to-the-point recipe: which quant and how to run it. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I see currently 2 of the 3 pinned items on the mainline
llama.cppdiscussion page are devoted to GPT-OSS, so I thought it is useful to have one here too.Support
Support for the
GPT-OSSmodels was added toik_llama.cppin PR #689. Support forMAXFP4, thefp4variant used in these models was added in #682. Improvements for the Harmony chat template and tool call support followed in #677 and #723.Performance for these models was improved via SWA-specific optimizations in #692, #754 and #757. Hybrid inference PP performance was increased in #698 by offloading only the activated experts to the GPU (a feature that appears to be only useful for the GPT-OSS models).
Performance
There are two main advantages of
ik_llama.cppoverllama.cpp: availability of higher quality quantization types and better performance. Hence, it makes sense to compareik_llama.cppandllama.cppperformance for GPT-OSS. I'll only provide results for the nativeMXFP4quantization type. All results reported here are for the 20B variant.The GPT-OSS models use sliding window attention (SWA) for every second layer, with a fairly small windows size of 128 tokens. There is a significant difference between
ik_llama.cppandllama.cpphow SWA layers are handled. In mainline, by default only the tokens required to compute the lastu-batchare kept in the cache. This has the distinct advantage of reducing KV-cache size by nearly a factor of 2. The downside of this is that the KV-cache is not reusable. To be able to reuse the KV-cache, one needs to add--swa-fullto thellama.cppcommand line. This keeps the full KV-cache also for SWA layers, and corresponds to what is done inik_llama.cpp. Hence, when comparing performance, I'll providellama.cppresults with and without--swas-full, the latter being a more fair comparison (due to the usability limitations one gets without--swa-full).CUDA performance, GPT-OSS-20B-MXFP4
Tests are run on an RTX-4080 GPU in a Ryzen-7950X rig.
Command line is
(the arguments in square brackets only apply to
ik_llama.cpp).PP performance
Here
ik_llama.cppis ~40% faster at smallN_KV, and 65% faster at 30k tokens compared to the corresponding calculation withllama.cppusing--swa-full.TG performance
Here performance is similar at low context length, but
ik_llama.cppis ~15% faster at 30k tokens compared to--swa-full.llama.cppmanages to outperformik_llama.cppby a small margin (4.5% at 30k tokens) for largeN_KVwhen not using--swa-full. I think this is simply due to the much smaller KQ mask for the SWA layers that has to be copied to the GPU for each token prediction. Theik_llama.cppmask is 4 MiB at 32k tokens, which takes 0.27 ms for a 15 GiB/s PCI-E, which is ~3.5% of the time taken per token generation at 130 t/s. Without--swa-fullthellama.cppKQ SWA mask is ~15X smaller, so time to copy to the GPU is negligible.llama.cpp with --swa-full
llama.cpp without --swa-full
ik_llama.cpp
CPU performance, GPT-OSS-20B-MXFP4
Tests are run on a Ryzen-7950X CPU. The command line used is
PP performance
Not much comment is needed here.
ik_llama.cppis 3.6X faster at zero context, and ~8.8X faster at 25k tokens (I did not have the patience to wait for thellama.cppresults up two 32k tokens). There is no difference with and without--swa-fullwhen runningllama.cppon the CPUTG performance
Here performance is about the same at zero context length. At 25k tokens
ik_llama.cppis about 1.5X faster.llama.cpp with --swa-full
llama.cpp without --swa-full
ik_llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions