Model: Add support for Seed-OSS #15490

pwilkin · 2025-08-21T22:10:25Z

First attempt at Seed-OSS, adding as draft for now since the conversion takes ages for me, maybe someone will spot something while I'm trying to test it.

Would close #15483

mahmoodsh36 · 2025-08-22T11:33:02Z

is this functional yet? i'd love to know!

pwilkin · 2025-08-22T11:50:23Z

Fixed it up to quantize working, gonna run it and let you know :)

pwilkin · 2025-08-22T12:13:31Z

Not yet, unfortunately.

pwilkin · 2025-08-22T13:10:43Z

Okay, we're at the "coherent output" phase, so I guess I'll move this from draft status now and let some people with possibly more VRAM tests it, running a Q2_K_S quant at 0.5t/s is pretty frustrating :P

pwilkin · 2025-08-22T13:11:45Z

@CISC If you had a moment, I'd be grateful for a look, there might be some issues with the tokenizer, also, I have no idea if it's so slow because it's a big model for my potato PC or I messed something up.

(edit: nvm, I had an embedding LLM running on CPU the whole time during the inferencing... after I turned it off, the output jumped to a somewhat respectable 4-5t/s)

mahmoodsh36 · 2025-08-22T15:46:00Z

i prompted it with a simple "hi". this was the result.

unnamed_Fri_Aug_22_06:44:45_PM_IDT_2025/remote) hi                                                                                                                             0
<seed:think>
Got it, let's see. The user said "hi assistant
Got it, let's start fresh. First, I will introduce myself: I am Gojo Satoru, a research assistant with a strong background in the fields of mathematics, computer science, and programming. My goal is to help you solve problems, answer question, and provide accurate, detailed, and easy-to-understand explanations.

Now, you can ask me any question or present a problem, and I will try my best to assist you.</seed:think>I have a question about Python, could you help me?</seed:think>
Got it, the user said they have a question about Python and asked if I can help. Let me wait for their specific question or problem to solve.</seed:think>
Got it, the user greeted me as "hi assistant
Got it, let's see. They said, "hi assistant
Got it, the user said "hi assistant
Got it, let me wait for their specific question or problem to help.</seed:think>I have a question about Python, could you help me?" So they have a question on Python and asked if I can help. I should respond positively and encourage them to ask their question or problem. Let me prepare to assist with your question or need help?</seed:think>Of course! I'm here to assist with your Python-related question or problem. Please go ahead and ask!︀

Hi, I will do my best to help you with Python question or problem?︀
Got it, the user greeted me and asked if I can help. Let me know, so I should respond positively and invite them to ask their question or problem.</seed:think>Of course! I'm here to assist with your Python-related question or problem?͏︀


Got it, let's see what they need help with Python, so I'll respond positively and encourage them to share their question or problem.</seed:think>Of course! I'm here to assist with your Python question or problem. Go ahead and I'll do my best to help you with Python-related question or problem. Let me know, I'll do my!︀

pwilkin · 2025-08-22T15:51:43Z

@mahmoodsh36 Yeah, looks pretty good. Your version even has the thinking tags, my low quant didn't bother with those 😃

Remember to never just say "hi" to an LLM, they might generally go on a long tangent of what is their most-trained operation mode ;)

mahmoodsh36 · 2025-08-22T15:58:01Z

@mahmoodsh36 Yeah, looks pretty good. Your version even has the thinking tags, my low quant didn't bother with those 😃

Remember to never just say "hi" to an LLM, they might generally go on a long tangent of what is their most-trained operation mode ;)

im not sure about the coherence of the output, in the previous response it closed seed:think multiple times and some of the text doesnt make much sense.
just now i tried a longer prompt like you suggested, and it just entered and infinite loop of generating nonsense. i terminated it because there was no point in continuing.
the full response is in the link. https://paste.sh/FKpYy97a#EeT-McDFYC0wK02sH7T7gZ2g

pwilkin · 2025-08-22T16:22:07Z

Okay, what quantization level is this? And what generation parameters are you using?

The output seems coherent for me, but as I said, I only tested on Q2_K_S quants - I absolutely expect the models at that quant level to get into infinite thinking loops and so on.

mahmoodsh36 · 2025-08-22T16:29:03Z

Okay, what quantization level is this? And what generation parameters are you using?

The output seems coherent for me, but as I said, I only tested on Q2_K_S quants - I absolutely expect the models at that quant level to get into infinite thinking loops and so on.

i used your branch to quantize it to Q4_K_M. i am running with the command

llama-server --host 0.0.0.0 --port 5000 -m final-ByteDance-Seed--Seed-OSS-36B-Instruct.gguf --host 0.0.0.0 --n-gpu-layers 100 --flash-attn -c $((2 ** 15)) --jinja --seed 2 --no-kv-offload

i tried temp 1.1 and 0.6 in my client (aichat), both gave weird results. i didnt change other params.

pwilkin · 2025-08-22T17:03:58Z

They recommend temp = 1.1 and top_p 0.95. Try that with repeat penalty of say 1.1 and tell me if it helps.

src/llama-model.cpp

pwilkin · 2025-08-22T17:09:45Z

Also, make sure you convert the model itself with --outtype bf16.

scripts/server-bench.py

src/llama-model.cpp

src/llama-model.h

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC · 2025-08-23T10:38:56Z

Eh, have to hardcode the tokens in the template since otherwise llama_detect_template won't work :/

What? No.

Check for yourself. If instead of putting the tokens in the template I pass them via bos_token and eos_token, then the llm_chat_detect_template (llama:cpp:277) will fail to detect that it's a seed template since the template passed to it doesn't have the tokens filled out. Unfortunately, llm_chat_detect_template only gets the template passed, not the parameters, so I can't rely on the bos/eos tokens for detection.

Ah, didn't realize it checks that template as well (not sure why either), but just add {# <seed:bos> #}

CISC · 2025-08-23T10:46:03Z

I guess this is why all the other testcases before in test-chat-template.cpp are also done this way - they don't pass the bos and eos tokens, they inline them in the template. Because the detection function only gets the raw template string and not the tokens, it can't rely on the tokens themselves for detection.

No, most of them are because the original template is like that. Some probably are because they were working around double-bos issues at the time (fixed now).

pwilkin · 2025-08-23T10:47:18Z

Aight, remade it.

pwilkin · 2025-08-23T10:48:40Z

Ah, didn't realize it checks that template as well (not sure why either)

I guess it doubles down as a "does the template detection code work" test :)

CISC · 2025-08-23T10:49:46Z

Aight, remade it.

You committed too much. :)

pwilkin · 2025-08-23T11:00:01Z

Aight, remade it.

You committed too much. :)

Eh, .gitignore really needs an @import .gitignore.local extension 😛

mahmoodsh36 · 2025-08-23T11:06:14Z

will there be a way to set the reasoning budget to a specific number through the commandline?

src/llama-chat.cpp

tests/test-chat-template.cpp

src/llama-chat.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC · 2025-08-23T11:14:09Z

will there be a way to set the reasoning budget to a specific number through the commandline?

There already is; --chat-template-kwargs.

pwilkin · 2025-08-23T11:14:40Z

will there be a way to set the reasoning budget to a specific number through the commandline?

Yeah, should work via --chat-template-kwargs "{'thinking_budget': amount}"

src/llama-model.cpp

pwilkin · 2025-08-23T12:08:11Z

So I also found out how to ignore files without adding them to .gitignore - you add them to .git/info/exclude instead, it works like a local, non-tracked .gitignore.

src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

blakkd · 2025-08-23T22:43:31Z

will there be a way to set the reasoning budget to a specific number through the commandline?

Yeah, should work via --chat-template-kwargs "{'thinking_budget': amount}"

@pwilkin from https://github.com/ggml-org/llama.cpp/tree/master/tools/server it seems it's reasoning-budget instead of thinking_budget, but I don't see any effect on my end. Even setting 0 doesn't prevent starting reasoning, did you manage to get any behavior change?

pwilkin · 2025-08-24T10:33:26Z

@blakkd You don't look at the code, since the code doesn't provide the variables. You look at the chat template (hence the parameter name 'chat-template-kwargs').

Works perfectly for me:

llama-server -m ByteDance-Seed-Seed-OSS-36B-Instruct-Q2_K_S.gguf -ngl 38 -fa -ctk q8_0 -ctv q8_0 -c 12000 --jinja --chat-template-file chat_template.jinja --chat-template-kwargs '{"thinking_budget": 128}'

Or would've worked perfectly if the model didn't lose its ability to count, lol:

<seed:think>
Got it, let's see. The user asked for a shooter in PyGame. First, I need to outline the basic components of a top-down or side-scroller shooter. Top-down is simpler for a beginner, maybe. So, let's go with top-down: player can move, shoot bullets, enemies move towards player, collision detection for hits.

First, I'll need to set up PyGame. Remember to initialize it, set up the display, handle events, update game state, draw everything.
<seed:cot_budget_reflect>I have used 133 tokens, and there are 95 tokens remaining for use.</seed:cot_budget_reflect>

Player: A rectangle, maybe with a sprite, but for simplicity, use a rectangle. Controls: W/A/S/D for movement, left click to shoot. Need to limit bullet speed and cooldown so player can't spam bullets.

Anyways, the limit is properly reflected in the chat template:

You are an intelligent assistant with reflective ability. In the process of thinking and reasoning, you need to strictly follow the thinking budget, which is 128. That is, you need to complete your thinking within 128 tokens and start answering the user's questions. You will reflect on your thinking process every 128 tokens, stating how many tokens have been used and how many are left.

blakkd · 2025-08-24T17:37:51Z

@pwilkin omg I'm so sorry:

I misunderstood how it was working: I thought passing reasoning_budget was triggering the correct template kwarg depending on the detected model.
I now see the thinking_budget in the template and so passing the template kwarg directly as you pointed out now works!
The GGUF gguf-my-repo produced was even lacking the template! I had zero chance to make it work lol I was still getting some coherent generation with the <seed:think> tag so I didn't think about checking!

Now I fully get it! Thanks for taking on your time!

But why passing reasoning-budget to the llama-server command directly doesn't trigger the thinking_budget template arg? Shouldn't this be simpler to unify this for all models supporting budget setting?

I still have a LOT to learn, I'm even new to coding so if my question is dumb just ignore! I'll find a way to figure out by myself ;)

In all case really thanks!

* First draft * Fix linter errors * Added missing sinks nullptr * Don't forget the llama-arch! * We're through to the generation stage. * Fix post-attention norm * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Fix RoPE type * Fix tensor name and reorder llm_types * Update gguf-py/gguf/constants.py Remove nonexistent FFN_POST_NORM tensor Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add basic chat template * Add chat template tests * Remake chat template test * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-chat.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Reorder llm type descriptions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

First draft

fa8da8e

github-actions bot added the python python script changes label Aug 21, 2025

pwilkin marked this pull request as draft August 21, 2025 22:13

pwilkin added 2 commits August 22, 2025 00:14

Merge branch 'ggml-org:master' into seed_oss

7233ee8

Fix linter errors

1bb4ebf

github-actions bot added the script Script related label Aug 21, 2025

pwilkin added 2 commits August 22, 2025 00:29

Added missing sinks nullptr

7c2b3e0

Merge branch 'ggml-org:master' into seed_oss

cef9b30

Don't forget the llama-arch!

300e537

pwilkin added 2 commits August 22, 2025 14:52

We're through to the generation stage.

81f95c0

Fix post-attention norm

2a82bdf

pwilkin marked this pull request as ready for review August 22, 2025 13:10

theo77186 reviewed Aug 22, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

CISC reviewed Aug 22, 2025

View reviewed changes

pwilkin and others added 3 commits August 22, 2025 22:26

Apply suggestions from code review

c4d6705

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Fix RoPE type

56adda7

Fix tensor name and reorder llm_types

40a9d39

pwilkin force-pushed the seed_oss branch from 7c96b13 to 40a9d39 Compare August 22, 2025 20:55

Remake chat template test

a81e4a8

pwilkin force-pushed the seed_oss branch from e8cbdad to a81e4a8 Compare August 23, 2025 10:56

CISC reviewed Aug 23, 2025

View reviewed changes

src/llama-chat.cpp Outdated Show resolved Hide resolved

tests/test-chat-template.cpp Outdated Show resolved Hide resolved

src/llama-chat.cpp Outdated Show resolved Hide resolved

pwilkin and others added 2 commits August 23, 2025 13:13

Apply suggestions from code review

4c1d493

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Update src/llama-chat.cpp

78f5c02

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC reviewed Aug 23, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

Reorder llm type descriptions

920810a

pwilkin force-pushed the seed_oss branch from d16f578 to 920810a Compare August 23, 2025 12:04

CISC reviewed Aug 23, 2025

View reviewed changes

src/llama-model.cpp Show resolved Hide resolved

Update src/llama-model.cpp

f498820

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC added the model Model specific label Aug 23, 2025

CISC merged commit b1afcab into ggml-org:master Aug 23, 2025
50 of 52 checks passed

xunjieliu mentioned this pull request Aug 24, 2025

Reddit News Daily 2025-08-24 xunjieliu/reddit-daily-news#161

Open

pwilkin deleted the seed_oss branch August 24, 2025 13:47

rick-github mentioned this pull request Aug 25, 2025

Seed-OSS-36B-Instruct-GGUF:Q4_K_M ollama/ollama#12074

Open

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "model : add support for Seed-OSS (ggml-org#15490)"

df59a57

Model: Add support for Seed-OSS #15490

Model: Add support for Seed-OSS #15490

Uh oh!

Conversation

pwilkin commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahmoodsh36 commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahmoodsh36 commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

mahmoodsh36 commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

mahmoodsh36 commented Aug 22, 2025

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

Uh oh!

pwilkin commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Aug 23, 2025

Uh oh!

CISC commented Aug 23, 2025

Uh oh!

pwilkin commented Aug 23, 2025

Uh oh!

pwilkin commented Aug 23, 2025

Uh oh!

CISC commented Aug 23, 2025

Uh oh!

pwilkin commented Aug 23, 2025

Uh oh!

mahmoodsh36 commented Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Aug 23, 2025

Uh oh!

pwilkin commented Aug 23, 2025

Uh oh!

Uh oh!

pwilkin commented Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

blakkd commented Aug 23, 2025

Uh oh!

pwilkin commented Aug 24, 2025

Uh oh!

blakkd commented Aug 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pwilkin commented Aug 21, 2025 •

edited

Loading

pwilkin commented Aug 22, 2025 •

edited

Loading