-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Model: Add support for Seed-OSS #15490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
is this functional yet? i'd love to know! |
Fixed it up to quantize working, gonna run it and let you know :) |
Not yet, unfortunately. |
Okay, we're at the "coherent output" phase, so I guess I'll move this from draft status now and let some people with possibly more VRAM tests it, running a Q2_K_S quant at 0.5t/s is pretty frustrating :P |
@CISC If you had a moment, I'd be grateful for a look, there might be some issues with the tokenizer, also, I have no idea if it's so slow because it's a big model for my potato PC or I messed something up. (edit: nvm, I had an embedding LLM running on CPU the whole time during the inferencing... after I turned it off, the output jumped to a somewhat respectable 4-5t/s) |
i prompted it with a simple "hi". this was the result.
|
@mahmoodsh36 Yeah, looks pretty good. Your version even has the thinking tags, my low quant didn't bother with those 😃 Remember to never just say "hi" to an LLM, they might generally go on a long tangent of what is their most-trained operation mode ;) |
im not sure about the coherence of the output, in the previous response it closed seed:think multiple times and some of the text doesnt make much sense. |
Okay, what quantization level is this? And what generation parameters are you using? The output seems coherent for me, but as I said, I only tested on Q2_K_S quants - I absolutely expect the models at that quant level to get into infinite thinking loops and so on. |
i used your branch to quantize it to Q4_K_M. i am running with the command llama-server --host 0.0.0.0 --port 5000 -m final-ByteDance-Seed--Seed-OSS-36B-Instruct.gguf --host 0.0.0.0 --n-gpu-layers 100 --flash-attn -c $((2 ** 15)) --jinja --seed 2 --no-kv-offload i tried temp 1.1 and 0.6 in my client (aichat), both gave weird results. i didnt change other params. |
They recommend temp = 1.1 and top_p 0.95. Try that with repeat penalty of say 1.1 and tell me if it helps. |
Also, make sure you convert the model itself with --outtype bf16. |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Ah, didn't realize it checks that template as well (not sure why either), but just add |
No, most of them are because the original template is like that. Some probably are because they were working around double-bos issues at the time (fixed now). |
Aight, remade it. |
I guess it doubles down as a "does the template detection code work" test :) |
You committed too much. :) |
Eh, |
will there be a way to set the reasoning budget to a specific number through the commandline? |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
There already is; |
Yeah, should work via |
So I also found out how to ignore files without adding them to |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
@pwilkin from https://github.com/ggml-org/llama.cpp/tree/master/tools/server it seems it's |
@blakkd You don't look at the code, since the code doesn't provide the variables. You look at the chat template (hence the parameter name 'chat-template-kwargs'). Works perfectly for me:
Or would've worked perfectly if the model didn't lose its ability to count, lol: <seed:think>
Got it, let's see. The user asked for a shooter in PyGame. First, I need to outline the basic components of a top-down or side-scroller shooter. Top-down is simpler for a beginner, maybe. So, let's go with top-down: player can move, shoot bullets, enemies move towards player, collision detection for hits.
First, I'll need to set up PyGame. Remember to initialize it, set up the display, handle events, update game state, draw everything.
<seed:cot_budget_reflect>I have used 133 tokens, and there are 95 tokens remaining for use.</seed:cot_budget_reflect>
Player: A rectangle, maybe with a sprite, but for simplicity, use a rectangle. Controls: W/A/S/D for movement, left click to shoot. Need to limit bullet speed and cooldown so player can't spam bullets. Anyways, the limit is properly reflected in the chat template: You are an intelligent assistant with reflective ability. In the process of thinking and reasoning, you need to strictly follow the thinking budget, which is 128. That is, you need to complete your thinking within 128 tokens and start answering the user's questions. You will reflect on your thinking process every 128 tokens, stating how many tokens have been used and how many are left. |
@pwilkin omg I'm so sorry:
Now I fully get it! Thanks for taking on your time! But why passing I still have a LOT to learn, I'm even new to coding so if my question is dumb just ignore! I'll find a way to figure out by myself ;) In all case really thanks! |
* First draft * Fix linter errors * Added missing sinks nullptr * Don't forget the llama-arch! * We're through to the generation stage. * Fix post-attention norm * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Fix RoPE type * Fix tensor name and reorder llm_types * Update gguf-py/gguf/constants.py Remove nonexistent FFN_POST_NORM tensor Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add basic chat template * Add chat template tests * Remake chat template test * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-chat.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Reorder llm type descriptions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
First attempt at Seed-OSS, adding as draft for now since the conversion takes ages for me, maybe someone will spot something while I'm trying to test it.
Would close #15483