|
| 1 | +# Seed-OSS-36B Usage Guide |
| 2 | + |
| 3 | +This guide describes how to run Seed-OSS-36B models with vLLM and native BF16 precision. Seed-OSS features unique "thinking budget" functionality for controlled reasoning and supports up to 512K context length. |
| 4 | + |
| 5 | +## Installing vLLM |
| 6 | + |
| 7 | +Seed-OSS support was recently added to vLLM main branch and is not yet available in any official release: |
| 8 | + |
| 9 | +```bash |
| 10 | +uv venv |
| 11 | +source .venv/bin/activate |
| 12 | +uv pip install git+https://github.com/vllm-project/vllm.git |
| 13 | +``` |
| 14 | + |
| 15 | +You may need to download the latest version of the transformer for compatibility: |
| 16 | + |
| 17 | +```bash |
| 18 | +uv pip install git+https://github.com/huggingface/transformers.git@56d68c6706ee052b445e1e476056ed92ac5eb383 |
| 19 | +``` |
| 20 | + |
| 21 | +## Running Seed-OSS-36B with BF16 |
| 22 | + |
| 23 | +There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads. |
| 24 | + |
| 25 | +Run tensor-parallel like this: |
| 26 | + |
| 27 | +```bash |
| 28 | +vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \ |
| 29 | + --host localhost \ |
| 30 | + --port 8000 \ |
| 31 | + --tensor-parallel-size 8 \ |
| 32 | + --enable-auto-tool-choice \ |
| 33 | + --tool-call-parser seed_oss \ |
| 34 | +``` |
| 35 | + |
| 36 | +* You can set `--max-model-len` to preserve memory. `--max-model-len=65536` is usually good for most scenarios and max is 512k. |
| 37 | +* You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency. `--max-num-batched-tokens=32768` is usually good for prompt-heavy workloads. But you can reduce it to 16k and 8k to reduce activation memory usage and decrease latency. |
| 38 | +* vLLM conservatively use 90% of GPU memory, you can set `--gpu-memory-utilization=0.95` to maximize KVCache. |
| 39 | +* Make sure to follow the command-line instructions to ensure the tool-calling functionality is properly enabled. |
| 40 | + |
| 41 | +## Thinking Budget Feature |
| 42 | + |
| 43 | +Users can flexibly specify the model's thinking budget. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. |
| 44 | + |
| 45 | +If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. |
| 46 | + |
| 47 | +## Usage Examples |
| 48 | + |
| 49 | +### OpenAI Client Usage |
| 50 | + |
| 51 | +You can use the OpenAI client as follows. You can pass `thinking_budget` through `extra_body` to control the thinking budget: |
| 52 | + |
| 53 | +```python |
| 54 | +from openai import OpenAI |
| 55 | + |
| 56 | +openai_api_key = "EMPTY" |
| 57 | +openai_api_base = "http://localhost:8000/v1" |
| 58 | + |
| 59 | +client = OpenAI( |
| 60 | + api_key=openai_api_key, |
| 61 | + base_url=openai_api_base, |
| 62 | +) |
| 63 | + |
| 64 | +models = client.models.list() |
| 65 | +model = models.data[0].id |
| 66 | + |
| 67 | +messages = [ |
| 68 | + {"role": "system", "content": "You are a helpful assistant"}, |
| 69 | + {"role": "user", "content": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"} |
| 70 | +] |
| 71 | +extra_body = {"chat_template_kwargs": {"thinking_budget": 512}} |
| 72 | +response = client.chat.completions.create( |
| 73 | + model=model, messages=messages, extra_body=extra_body |
| 74 | +) |
| 75 | +content = response.choices[0].message.content |
| 76 | +print("content:\n", content) |
| 77 | +``` |
| 78 | + |
| 79 | +### Example Outputs |
| 80 | + |
| 81 | +**thinking_budget = 512**: |
| 82 | +``` |
| 83 | +content: |
| 84 | + <seed:think> |
| 85 | +Got it, let's try to figure out this problem step by step. First, the question is about Janet's ducks laying eggs, and we need to find out how much money she makes at the farmers' market each day. |
| 86 | +<seed:cot_budget_reflect>I have used 138 tokens, and there are 374 tokens remaining for use.</seed:cot_budget_reflect> |
| 87 | + Let's start by listing out the information given. |
| 88 | +
|
| 89 | +First, her ducks lay 16 eggs per day. That's the total number of eggs she has each day, right? Then, she does a few things with these eggs: she eats three for breakfast every morning, bakes muffins with four every day, and sells the remainder at the farmers' market. Each of those sold eggs is $2, so we need to find the remainder first and then multiply by 2 to get the daily earnings. |
| 90 | +<seed:cot_budget_reflect>I have used 260 tokens, and there are 252 tokens remaining for use.</seed:cot_budget_reflect> |
| 91 | +
|
| 92 | +Let me write that down. Total eggs: 16. Eggs used: eaten (3) plus muffins (4). So first, let's add up how many eggs she uses each day. 3 + 4 = 7 eggs used. Then the remainder is total eggs minus used eggs, so 16 - 7 = 9 eggs left to sell. Wait, is that right? Let me check again. 16 total, subtract 3 eaten, that's 13 left, then subtract 4 for muffins, that's 13 - 4 = 9. |
| 93 | +<seed:cot_budget_reflect>I have used 395 tokens, and there are 117 tokens remaining for use.</seed:cot_budget_reflect> |
| 94 | + Yep, that's 9 eggs. Then she sells each for $2, so 9 times 2 is $18. That seems straightforward. Let me make sure I didn't miss anything. The problem says "daily," so we don't have to worry about anything over multiple days. Just one day: 16 eggs, use 3+4=7, sell 9, 9*2=18. |
| 95 | +<seed:cot_budget_reflect>I have exhausted my token budget, and now I will start answering the question.</seed:cot_budget_reflect> |
| 96 | +</seed:think>To determine how much Janet makes at the farmers' market daily, follow these steps: |
| 97 | +
|
| 98 | +### Step 1: Calculate total eggs laid daily |
| 99 | +Janet’s ducks lay **16 eggs per day**. |
| 100 | +
|
| 101 | +### Step 2: Calculate eggs used daily |
| 102 | +- She eats 3 eggs for breakfast. |
| 103 | +- She uses 4 eggs for muffins. |
| 104 | +Total eggs used = \(3 + 4 = 7\) eggs. |
| 105 | +
|
| 106 | +### Step 3: Find the number of eggs sold |
| 107 | +Remaining eggs = Total eggs - Eggs used = \(16 - 7 = 9\) eggs. |
| 108 | +
|
| 109 | +### Step 4: Calculate daily earnings |
| 110 | +She sells each egg for $2, so total earnings = \(9 \times 2 = 18\) dollars. |
| 111 | +
|
| 112 | +**Answer:** 18 |
| 113 | +``` |
| 114 | + |
| 115 | +**thinking_budget = 0**: |
| 116 | +``` |
| 117 | +content: |
| 118 | + The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect> |
| 119 | +</seed:think>To determine how much Janet makes daily at the farmers' market, follow these steps: |
| 120 | +
|
| 121 | +### Step 1: Calculate total eggs laid |
| 122 | +Janet’s ducks lay **16 eggs per day**. |
| 123 | +
|
| 124 | +### Step 2: Calculate eggs used |
| 125 | +- She eats 3 eggs for breakfast. |
| 126 | +- She uses 4 eggs for muffins. |
| 127 | +- Total eggs used: \(3 + 4 = 7\) eggs. |
| 128 | +
|
| 129 | +### Step 3: Find remaining eggs for sale |
| 130 | +Subtract used eggs from total eggs: |
| 131 | +\(16 - 7 = 9\) eggs. |
| 132 | +
|
| 133 | +### Step 4: Calculate daily earnings |
| 134 | +She sells each remaining egg for $2: |
| 135 | +\(9 \times 2 = 18\) dollars. |
| 136 | +
|
| 137 | +**Answer:** 18 |
| 138 | +``` |
| 139 | + |
| 140 | +### curl Usage |
| 141 | + |
| 142 | +```bash |
| 143 | +curl http://localhost:8000/v1/chat/completions \ |
| 144 | + -H "Content-Type: application/json" \ |
| 145 | + -d '{ |
| 146 | + "model": "ByteDance-Seed/Seed-OSS-36B-Instruct", |
| 147 | + "messages": [{"role": "user", "content": "Explain quantum computing"}], |
| 148 | + "chat_template_kwargs": { |
| 149 | + "thinking_budget": 512 |
| 150 | + } |
| 151 | + }' |
| 152 | +``` |
| 153 | + |
| 154 | +## Benchmarking |
| 155 | + |
| 156 | +We used the following script to benchmark `ByteDance-Seed/Seed-OSS-36B-Instruct` on RTX 3090 GPU: |
| 157 | + |
| 158 | +``` |
| 159 | +vllm bench serve \ |
| 160 | + --backend vllm \ |
| 161 | + --model ByteDance-Seed/Seed-OSS-36B-Instruct \ |
| 162 | + --endpoint /v1/completions \ |
| 163 | + --host localhost \ |
| 164 | + --port 8000 \ |
| 165 | + --dataset-name random \ |
| 166 | + --random-input 800 \ |
| 167 | + --random-output 100 \ |
| 168 | + --request-rate 2 \ |
| 169 | + --num-prompt 100 \ |
| 170 | +``` |
| 171 | + |
| 172 | +Sample output: |
| 173 | + |
| 174 | +``` |
| 175 | +============ Serving Benchmark Result ============ |
| 176 | +Successful requests: 100 |
| 177 | +Request rate configured (RPS): 2.00 |
| 178 | +Benchmark duration (s): 54.08 |
| 179 | +Total input tokens: 79934 |
| 180 | +Total generated tokens: 10000 |
| 181 | +Request throughput (req/s): 1.85 |
| 182 | +Output token throughput (tok/s): 184.92 |
| 183 | +Total Token throughput (tok/s): 1663.06 |
| 184 | +---------------Time to First Token---------------- |
| 185 | +Mean TTFT (ms): 97.96 |
| 186 | +Median TTFT (ms): 99.71 |
| 187 | +P99 TTFT (ms): 128.60 |
| 188 | +-----Time per Output Token (excl. 1st token)------ |
| 189 | +Mean TPOT (ms): 44.39 |
| 190 | +Median TPOT (ms): 43.74 |
| 191 | +P99 TPOT (ms): 49.19 |
| 192 | +---------------Inter-token Latency---------------- |
| 193 | +Mean ITL (ms): 44.39 |
| 194 | +Median ITL (ms): 46.18 |
| 195 | +P99 ITL (ms): 64.52 |
| 196 | +================================================== |
| 197 | +``` |
0 commit comments