-
Notifications
You must be signed in to change notification settings - Fork 154
Add GPT-OSS from OpenAI - closed in favor of 689 #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Model loads and runs (CPU only), but PPL is much to high (~1500 for 1st batch vs ~200 in mainline). Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
It was the SWA thta was missing in the previous commit. There are issues with EOG tokens, so this still needs to be added.
Just a copy from mainline
Haven't turned it on yet, but observe slightly better PP and slightly worse TG performance with that.
I would take prompt processing wins any day. while messing with these models on mainline through open webui it has become clear they are extremely sensitive to any deviations from the Harmony template. Toolcalls can be really tricky if it is not already familiar with the tools, at least for the 120b variant. I have a lot more notes here, I'll get more into the weeds as things progress. |
Thanks for the heads up. It looks like there will be bigger changes required in |
for tools it was not trained for, or if they look familiar but not quite right, I have seen the model get a bit confused at times. It can be hard to decipher what the "caveman" simplified cot reasoning is hinting at at times. a proper tool call looks like this:
the tool result should then be given back in a new tool message a such:
it has to be in this format. it expects there to be a tool call by it a one complete message, then the rsult must be in the new message. Before a few things got a bit better through some unmerged PR's and template fixes, I would see the model get confused as it was waiting for the tool result to return. It knew it had the data, but it was waiting, and was generally confused. I was using a simple forecast tool, and it would remark on how it was waiting for the tool fucntion to return. somehow the tool call had commentary and probably an example result? but it was waiting for the tool to return. sometimes it would return the data properly to me, sometimes it would result in more toolcalls as it did not understand what was going on. it appears that the way tools are registrered might leave it confused at times. without OAI's own jinja template and some PR's that have helped a lot errors like this occur: ![]() ![]() ![]() ![]() with the jinja template by OAI: ![]() without the updated template, but with some other fixes: ![]() ![]() here is something it said back when I first tried using the model:
LM Studio's implementation seems to work. they however is bypassing jinja completely at all times for this model it seems, and added a library that can be updated separately to parse the template. that variant seems to work fine from my testing so far, but there's no source code as far as I can tell. for debugging at least, I think it will be important to log what is happening on the Harmony level. this is the PR over on mainline that for now is unmerged, but appears to fix the main bulk of the issues, and the official OAI jinja appears to help with the rest, or at least most of it. Edit: appears that the ggml-org variant has been updated with corrections to the template, as of... a day ago or so. Edit 2: it did not. I still get better results from using a specified jinja template |
The Wikitext2 perplexity (PPL) computed with the mainline
Well, let's compute PPL as a function of context length. We all know PPL decreases with context length (at least up to the original training context), instruction tuned or not (or strange spaces at unexpected locations or not). Here is what we get (using mainline ![]() The black circles are for GPT-OSS-20B computed with mainline As expected, the PPL of the DeepSeek model decreases with context length. We also see that PPL of the GPT-OSS-20B model decreases with context length up to 64 tokens, but then goes completely crazy. Interestingly enough, the model has a head size of 64 and there are 64 attention heads, so I'm wondering if something went wrong with the MHA implementation/GGUF attention tensor preparation? But that's just a speculation. In any case, I wouldn't know why the model's desperation to output For the green data points, I have turned on usage of this FA implementation in But given these results, it is not unlikely mainline's implementation is incorrect. |
Given this appears to have been a rushjob with a lot of new technologies, I would not be surprised. These models do appear to be a bit odd at times. I am not familiar enough with perplexity measurements to know if that could be related Any way to test the perplexity of the other implementations? |
You want to test perplexity with other inference frameworks such as vLLM? I'm not familiar with other tool kits, so perhaps someone else can chime in (or do the calculations)? Before someone else says that perplexity values computed with other tool kits are not directly comparable to |
I will give it a shot. We are looking for major errors here. That graph you showed does not look right to me. I do not like how it goes up like that. I have had issues getting vllm to work in the past, but I will try to get whatever data I can |
Transformers has a doc about it here. |
Turning it off for now as performance becomes more variable, so perhaps I'm running into thermal trottling imore often because of making the CPU work too hard.
I implemented, with some help from Claude, what is probaly a mediocre variant of the perplexity measurement tool llama.cpp uses. I would not vouch for its accuracy or the numbers, but hopefully it will do fine. @ikawrakow predicted, and demonstrated that IT models do just fine and follow this nice little graph. ![]() I think that is enough to demonstrate that my script isn't completely off the mark. ![]() ![]() next up I am adding support for using llama.cpp with this perplexity calculation script, to verify that this will largely replicate what @ikawrakow showed. however, so far the vllm results show the trendline we want to see, I would not focus on the numbers as they might be inflated |
So, first attempt at running it wih llama.cpp as the backend showed the same trendline as in vllm. However soem errors showed up during evaluation the error was of the type Context Length 2048: So while we did not get to replicate and confirm the upwards trend we were looking for, we still saw signs of something going wrong |
ah my bad, Linux |
ok, I was using some outdated settings! with we are now at this for IK prompt eval time = 96.98 ms / 65 tokens ( 1.49 ms per token, 670.26 tokens per second) and mainline is getting this with the same settings prompt eval time = 114.47 ms / 65 tokens ( 1.76 ms per token, 567.84 tokens per second) I had Still on Linux, the 20B model |
I would try and get rid of |
ah good catch! I didn't think about the threads. the setup was mainly copied over from the 120b model. also, DRY? I'm not sure what this means in this context |
I don't think it would be useful for this model (I feel like repetition penalty should be off with newer models), but it is a sampler with support added here in #513. |
Try |
I do think that DRY is better as a repetition penalty than the repetition penalty sampler, but it still only handles literal repetition, and doesn't handle semantic repetition, but it also has a downside as it has "false positives" (I use quotes because correct is subjective).
Yep, given the size and architecture, but it still looks a lot worse than I'd have expected. |
Thank you for testing. A prompt of 65 tokens does not fully show the PP performance advantage of It is interesting that TG performance for the 20B model on your 3090 is lower than TG performance on my RTX-4080, which has a lower memory bandwidth than the 3090. This would indicate that TG for this model is at least partially compute/latency bound. |
No problem! glad to help! in any case, it's not building on windows.
|
In the process of of submitting a PR that enables CUDA graphs, I'll look into the Windows build failure when I'm done with that. |
@espen96 Does the last commit fix the Windows build issue? |
not quite
|
And now? |
that seems to have done it! we have a build |
We are testing the 120b model today, on windows. I gave it a bunch of glsl code to explain, and here on IK that gives us:
and on mainline:
for a simple request for info on napoleon: IK gives
and mainline:
I will note that my rig is not ideal To me, I feel like these number are within run to run variance on my rig. I should note that mainline just added swa checkpoints. I don't have a build with that yet, |
Thanks for testing. Can you also post the commands that you used for the tests? Thanks. |
I haven't dealt with SWA at all in |
|
@espen96 Can you try the 120B mode with |
that I can! on IK this bumps us up a bit, we are now closer to 19 TG/s on the simple napoleon query. Similar results with mainline. and my glsl query is up to around 18-19 TG/s on IK and mainline. |
I have noticed that mainline has become better compared to last time I tested, but it is also model specific. In the tests I ran for PR #689 |
The changes in this PR were added to #689, so closing. |
This PR adds support for OpenAI's GPT-OSS models.
Metal and Vulkan are missing, but I want to still declare it ready for testing to get feedback on the main platforms
ik_llama.cpp
is being used.I had to do a fairly significant refactoring to accommodate the required vocabulary changes, so please test with other models as well to make sure that I have not broken something.
In terms of performance, the PR is significantly faster than mainline CPU-only (see graphs below). On CUDA, I get ~40% better PP performance than mainline, but ~10% lower TG performance (RTX-4080, fully offloading the 20B variant).
Original description
Not ready, only CPU works (but there are issues with EOG tokens that need to be resolved).
The model uses biases for the MoE ops, so
-fmoe
is not functional (and if specified, it will be ignored).The optimized CPU flash attention is also not used as I need to add attention sinks there.
Still opening a draft PR for visibility.
Getting the same (very high) PPL as mainline
llama.cpp
.Quick performance comparison on a Ryzen-7950X CPU using the 20B MXFP4 model from ggml-org
ik_llama.cpp
llama.cpp
So, TG is about the same, PP is ~3X faster.
Update 1
CUDA seems to be working now. Haven't added attention sinks to all FA implementations, so at this point FA will only work on Turing or newer.
It seems the mainline FA implementation is better for this model (head size is 64, a head size I haven't paid much attention to). Hence, PP is better than mainline up to a context of 8k tokens, but then falls behind. TG is about the same as mainline without FA, but is ~20% lower with FA, so clearly something is not quite right there. This model has a surprisingly large performance difference between FA and no FA even for small contexts.
Update 2
Added the ability to use
-fmoe
(CUDA only for now). With this PP performance is now ~40% percent better than mainline for short contexts, and still better at 16k tokens (I cannot go beyond that with my 16 GB GPU and the MXFP4 model from ggml-org). TG performance also improves but is still lower than than mainline. Hence I decided to test mainline without CUDA graphs, and, to my surprise, observe a ~10% lower performance (surprise because when I last tested the benefit of CUDA graphs in mainline on Linux it was very minor, a 2-3% sort of thing).There is another very interesting observation. While working on
-fmoe
I had a bug, which caused a crash when trying to runllama-bench
orllama-sweep-bench
, butperplexity
was working fine for the default context of 512. So, I decided to see if I'll get the crash when runningperplexity
with a larger context. I used a context of 4096, it still worked, and I got a PPL of over 2000. I checked with mainline and, lo and behold, the same PPL also there (2155!!! to be precise). Based on that, I'm now almost 100% sure the implementation in mainline is incorrect. There is issue 15155 in mainline discussing the high PPL, and concerned users are being told that a high perplexity is normal for instruction tuned models. Ha ha, seriously? A PPL of 2000 after looking at a context of 4000 English-only tokens means the model has absolutely no idea how to continue the given text.Update 3
Added attention sinks to the optimized CPU flash attention implementation. Here is what we get with
sweep-bench
for the 20B GPT-OSS model withQ8_0
KV cache running on a Ryzen-7950X CPUPrompt processing:
Token generation:
ik_llama.cpp
PP is 3.4X faster at zero context and 5.5X faster for 8k tokens in the cache.ik_llama.cpp
TG is about the same at zero context and 19% faster for 8k tokens in the cache.This is before having added
-fmoe
CPU implementation.llama.cpp
ik_llama.cpp