Skip to content

Eval bug: Official gpt-oss-120b model output has dropped/missing tokens, can't count to 100 #16263

@woof-dog

Description

@woof-dog

Name and Version

version: 6586 (835b2b9)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen + RTX 3090. Tested on friend's machine with similar results.

Models

gpt-oss-120b-GGUF in mxfp4 straight from ggml-org: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

Problem description & steps to reproduce

When I run llama-server with standard arguments, tokens are dropped from the output, severely affecting coding problems:

./build/bin/llama-server --model models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 32768 -ngl 99 -ot ".ffn_(up|down)_exps.=CPU" --jinja

Output for prompt Count from 1 to 100 and output in a code block in groups of 10.:

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
2122 23  25   28  30
31  33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
1 2  4 5 6 7 8 9 10 12 1314 15 16  18  20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 4243 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81  83 84 85 86 87 88  90
91  93 94 95 96  98 99 
1 2 34 5  7 8 9 10
11 12 13 1415 1617 1819 20
21 22 23 24 25 26 27 28 29 3031 3233 3435 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100

Prompt: Write a python hello world script

Response after reasoning (notice missing token in the comment):

Here’s the classic “Hello, World!” program in Python.  
Save the code below to a file (e.g., `hello_world.py`) and run it with `python hello_world.py`.

```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
hello_world.py

A minimal Python script prints "Hello, World!" to the.
"""

def main() -> None:
    """Entry point of the script."""
    print("Hello, World!")


if __name__ == "__main__":
    main()
```

**How to run**

```bash
$ python hello_world.py
Hello, World!
```

The shebang line (`#!/usr/bin/env python3`) lets you execute the script directly on Unix‑like systems if you make it executable:

```bash
$ chmod +x hello_world.py
$ ./hello_world.py
Hello, World!
```

This happens via server API (I use streaming enabled) but also on the webui:

Image

Possibly related to vllm-project/vllm#23335

First Bad Commit

No response

Relevant log output

srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: GPT-OSS
slot get_availabl: id  0 | task 2067 | selected slot by lcs similarity, lcs_len = 64, similarity = 0.165 (> 0.100 thold)
slot launch_slot_: id  0 | task 2371 | processing task
slot update_slots: id  0 | task 2371 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 73
slot update_slots: id  0 | task 2371 | kv cache rm [64, end)
slot update_slots: id  0 | task 2371 | prompt processing progress, n_past = 73, n_tokens = 9, progress = 0.123288
slot update_slots: id  0 | task 2371 | prompt done, n_past = 73, n_tokens = 9
slot update_slots: id  0 | task 2371 | SWA checkpoint erase, pos_min = 0, pos_max = 85, size = 3.025 MiB
slot update_slots: id  0 | task 2371 | SWA checkpoint create, pos_min = 0, pos_max = 72, size = 2.568 MiB, total = 3/3 (8.617 MiB)
slot      release: id  0 | task 2371 | stop processing: n_past = 299, truncated = 0
slot print_timing: id  0 | task 2371 | 
prompt eval time =     164.57 ms /     9 tokens (   18.29 ms per token,    54.69 tokens per second)
       eval time =    6640.77 ms /   227 tokens (   29.25 ms per token,    34.18 tokens per second)
      total time =    6805.34 ms /   236 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)server/webui

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions