Skip to content

Conversation

pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Aug 21, 2025

First attempt at Seed-OSS, adding as draft for now since the conversion takes ages for me, maybe someone will spot something while I'm trying to test it.

Would close #15483

@github-actions github-actions bot added the python python script changes label Aug 21, 2025
@pwilkin pwilkin marked this pull request as draft August 21, 2025 22:13
@github-actions github-actions bot added the script Script related label Aug 21, 2025
@mahmoodsh36
Copy link

is this functional yet? i'd love to know!

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

Fixed it up to quantize working, gonna run it and let you know :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

Not yet, unfortunately.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

Okay, we're at the "coherent output" phase, so I guess I'll move this from draft status now and let some people with possibly more VRAM tests it, running a Q2_K_S quant at 0.5t/s is pretty frustrating :P

@pwilkin pwilkin marked this pull request as ready for review August 22, 2025 13:10
@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

@CISC If you had a moment, I'd be grateful for a look, there might be some issues with the tokenizer, also, I have no idea if it's so slow because it's a big model for my potato PC or I messed something up.

(edit: nvm, I had an embedding LLM running on CPU the whole time during the inferencing... after I turned it off, the output jumped to a somewhat respectable 4-5t/s)

@mahmoodsh36
Copy link

i prompted it with a simple "hi". this was the result.

unnamed_Fri_Aug_22_06:44:45_PM_IDT_2025/remote) hi                                                                                                                             0
<seed:think>
Got it, let's see. The user said "hi assistant
Got it, let's start fresh. First, I will introduce myself: I am Gojo Satoru, a research assistant with a strong background in the fields of mathematics, computer science, and programming. My goal is to help you solve problems, answer question, and provide accurate, detailed, and easy-to-understand explanations.

Now, you can ask me any question or present a problem, and I will try my best to assist you.</seed:think>I have a question about Python, could you help me?</seed:think>
Got it, the user said they have a question about Python and asked if I can help. Let me wait for their specific question or problem to solve.</seed:think>
Got it, the user greeted me as "hi assistant
Got it, let's see. They said, "hi assistant
Got it, the user said "hi assistant
Got it, let me wait for their specific question or problem to help.</seed:think>I have a question about Python, could you help me?" So they have a question on Python and asked if I can help. I should respond positively and encourage them to ask their question or problem. Let me prepare to assist with your question or need help?</seed:think>Of course! I'm here to assist with your Python-related question or problem. Please go ahead and ask!︀

Hi, I will do my best to help you with Python question or problem?︀
Got it, the user greeted me and asked if I can help. Let me know, so I should respond positively and invite them to ask their question or problem.</seed:think>Of course! I'm here to assist with your Python-related question or problem?͏︀


Got it, let's see what they need help with Python, so I'll respond positively and encourage them to share their question or problem.</seed:think>Of course! I'm here to assist with your Python question or problem. Go ahead and I'll do my best to help you with Python-related question or problem. Let me know, I'll do my!︀

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

@mahmoodsh36 Yeah, looks pretty good. Your version even has the thinking tags, my low quant didn't bother with those 😃

Remember to never just say "hi" to an LLM, they might generally go on a long tangent of what is their most-trained operation mode ;)

@mahmoodsh36
Copy link

@mahmoodsh36 Yeah, looks pretty good. Your version even has the thinking tags, my low quant didn't bother with those 😃

Remember to never just say "hi" to an LLM, they might generally go on a long tangent of what is their most-trained operation mode ;)

im not sure about the coherence of the output, in the previous response it closed seed:think multiple times and some of the text doesnt make much sense.
just now i tried a longer prompt like you suggested, and it just entered and infinite loop of generating nonsense. i terminated it because there was no point in continuing.
the full response is in the link. https://paste.sh/FKpYy97a#EeT-McDFYC0wK02sH7T7gZ2g

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

Okay, what quantization level is this? And what generation parameters are you using?

The output seems coherent for me, but as I said, I only tested on Q2_K_S quants - I absolutely expect the models at that quant level to get into infinite thinking loops and so on.

@mahmoodsh36
Copy link

Okay, what quantization level is this? And what generation parameters are you using?

The output seems coherent for me, but as I said, I only tested on Q2_K_S quants - I absolutely expect the models at that quant level to get into infinite thinking loops and so on.

i used your branch to quantize it to Q4_K_M. i am running with the command

llama-server --host 0.0.0.0 --port 5000 -m final-ByteDance-Seed--Seed-OSS-36B-Instruct.gguf --host 0.0.0.0 --n-gpu-layers 100 --flash-attn -c $((2 ** 15)) --jinja --seed 2 --no-kv-offload

i tried temp 1.1 and 0.6 in my client (aichat), both gave weird results. i didnt change other params.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

They recommend temp = 1.1 and top_p 0.95. Try that with repeat penalty of say 1.1 and tell me if it helps.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 22, 2025

Also, make sure you convert the model itself with --outtype bf16.

@CISC
Copy link
Collaborator

CISC commented Aug 23, 2025

Eh, have to hardcode the tokens in the template since otherwise llama_detect_template won't work :/

What? No.

Check for yourself. If instead of putting the tokens in the template I pass them via bos_token and eos_token, then the llm_chat_detect_template (llama:cpp:277) will fail to detect that it's a seed template since the template passed to it doesn't have the tokens filled out. Unfortunately, llm_chat_detect_template only gets the template passed, not the parameters, so I can't rely on the bos/eos tokens for detection.

Ah, didn't realize it checks that template as well (not sure why either), but just add {# <seed:bos> #}

@CISC
Copy link
Collaborator

CISC commented Aug 23, 2025

I guess this is why all the other testcases before in test-chat-template.cpp are also done this way - they don't pass the bos and eos tokens, they inline them in the template. Because the detection function only gets the raw template string and not the tokens, it can't rely on the tokens themselves for detection.

No, most of them are because the original template is like that. Some probably are because they were working around double-bos issues at the time (fixed now).

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 23, 2025

Aight, remade it.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 23, 2025

Ah, didn't realize it checks that template as well (not sure why either)

I guess it doubles down as a "does the template detection code work" test :)

@CISC
Copy link
Collaborator

CISC commented Aug 23, 2025

Aight, remade it.

You committed too much. :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 23, 2025

Aight, remade it.

You committed too much. :)

Eh, .gitignore really needs an @import .gitignore.local extension 😛

@mahmoodsh36
Copy link

will there be a way to set the reasoning budget to a specific number through the commandline?

pwilkin and others added 2 commits August 23, 2025 13:13
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
@CISC
Copy link
Collaborator

CISC commented Aug 23, 2025

will there be a way to set the reasoning budget to a specific number through the commandline?

There already is; --chat-template-kwargs.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 23, 2025

will there be a way to set the reasoning budget to a specific number through the commandline?

Yeah, should work via --chat-template-kwargs "{'thinking_budget': amount}"

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 23, 2025

So I also found out how to ignore files without adding them to .gitignore - you add them to .git/info/exclude instead, it works like a local, non-tracked .gitignore.

Co-authored-by: Sigbjørn Skjæret <[email protected]>
@CISC CISC added the model Model specific label Aug 23, 2025
@CISC CISC merged commit b1afcab into ggml-org:master Aug 23, 2025
50 of 52 checks passed
@blakkd
Copy link

blakkd commented Aug 23, 2025

will there be a way to set the reasoning budget to a specific number through the commandline?

Yeah, should work via --chat-template-kwargs "{'thinking_budget': amount}"

@pwilkin from https://github.com/ggml-org/llama.cpp/tree/master/tools/server it seems it's reasoning-budget instead of thinking_budget, but I don't see any effect on my end. Even setting 0 doesn't prevent starting reasoning, did you manage to get any behavior change?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 24, 2025

@blakkd You don't look at the code, since the code doesn't provide the variables. You look at the chat template (hence the parameter name 'chat-template-kwargs').

Works perfectly for me:

llama-server -m ByteDance-Seed-Seed-OSS-36B-Instruct-Q2_K_S.gguf -ngl 38 -fa -ctk q8_0 -ctv q8_0 -c 12000 --jinja --chat-template-file chat_template.jinja --chat-template-kwargs '{"thinking_budget": 128}'

Or would've worked perfectly if the model didn't lose its ability to count, lol:

<seed:think>
Got it, let's see. The user asked for a shooter in PyGame. First, I need to outline the basic components of a top-down or side-scroller shooter. Top-down is simpler for a beginner, maybe. So, let's go with top-down: player can move, shoot bullets, enemies move towards player, collision detection for hits.

First, I'll need to set up PyGame. Remember to initialize it, set up the display, handle events, update game state, draw everything.
<seed:cot_budget_reflect>I have used 133 tokens, and there are 95 tokens remaining for use.</seed:cot_budget_reflect>

Player: A rectangle, maybe with a sprite, but for simplicity, use a rectangle. Controls: W/A/S/D for movement, left click to shoot. Need to limit bullet speed and cooldown so player can't spam bullets.

Anyways, the limit is properly reflected in the chat template:

You are an intelligent assistant with reflective ability. In the process of thinking and reasoning, you need to strictly follow the thinking budget, which is 128. That is, you need to complete your thinking within 128 tokens and start answering the user's questions. You will reflect on your thinking process every 128 tokens, stating how many tokens have been used and how many are left.

@pwilkin pwilkin deleted the seed_oss branch August 24, 2025 13:47
@blakkd
Copy link

blakkd commented Aug 24, 2025

@pwilkin omg I'm so sorry:

  1. I misunderstood how it was working: I thought passing reasoning_budget was triggering the correct template kwarg depending on the detected model.
  2. I now see the thinking_budget in the template and so passing the template kwarg directly as you pointed out now works!
  3. The GGUF gguf-my-repo produced was even lacking the template! I had zero chance to make it work lol I was still getting some coherent generation with the <seed:think> tag so I didn't think about checking!

Now I fully get it! Thanks for taking on your time!

But why passing reasoning-budget to the llama-server command directly doesn't trigger the thinking_budget template arg? Shouldn't this be simpler to unify this for all models supporting budget setting?

I still have a LOT to learn, I'm even new to coding so if my question is dumb just ignore! I'll find a way to figure out by myself ;)

In all case really thanks!

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 25, 2025
* First draft

* Fix linter errors

* Added missing sinks nullptr

* Don't forget the llama-arch!

* We're through to the generation stage.

* Fix post-attention norm

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Fix RoPE type

* Fix tensor name and reorder llm_types

* Update gguf-py/gguf/constants.py

Remove nonexistent FFN_POST_NORM tensor

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add basic chat template

* Add chat template tests

* Remake chat template test

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update src/llama-chat.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Reorder llm type descriptions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes script Script related testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add support for Bytedance Seed-OSS models

6 participants