Eval bug: Official gpt-oss-120b model output has dropped/missing tokens, can't count to 100

### Name and Version

version: 6586 (835b2b915)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Ryzen + RTX 3090. Tested on friend's machine with similar results.

### Models

gpt-oss-120b-GGUF in mxfp4 straight from ggml-org: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

### Problem description & steps to reproduce

When I run llama-server with standard arguments, tokens are dropped from the output, severely affecting coding problems:

```
./build/bin/llama-server --model models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 32768 -ngl 99 -ot ".ffn_(up|down)_exps.=CPU" --jinja
```

Output for prompt `Count from 1 to 100 and output in a code block in groups of 10.`:

```text
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
2122 23  25   28  30
31  33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
```

```text
1 2  4 5 6 7 8 9 10 12 1314 15 16  18  20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 4243 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81  83 84 85 86 87 88  90
91  93 94 95 96  98 99 
```

```text
1 2 34 5  7 8 9 10
11 12 13 1415 1617 1819 20
21 22 23 24 25 26 27 28 29 3031 3233 3435 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
```

---

Prompt: `Write a python hello world script`

Response after reasoning (notice missing token in the comment):

````markdown
Here’s the classic “Hello, World!” program in Python.  
Save the code below to a file (e.g., `hello_world.py`) and run it with `python hello_world.py`.

```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
hello_world.py

A minimal Python script prints "Hello, World!" to the.
"""

def main() -> None:
    """Entry point of the script."""
    print("Hello, World!")


if __name__ == "__main__":
    main()
```

**How to run**

```bash
$ python hello_world.py
Hello, World!
```

The shebang line (`#!/usr/bin/env python3`) lets you execute the script directly on Unix‑like systems if you make it executable:

```bash
$ chmod +x hello_world.py
$ ./hello_world.py
Hello, World!
```
````

This happens via server API (I use streaming enabled) but also on the webui: 

<img width="810" height="561" alt="Image" src="https://github.com/user-attachments/assets/254b1410-17a1-40b5-b0b5-f0bce8782f93" />

Possibly related to https://github.com/vllm-project/vllm/issues/23335

### First Bad Commit

_No response_

### Relevant log output

```shell
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: GPT-OSS
slot get_availabl: id  0 | task 2067 | selected slot by lcs similarity, lcs_len = 64, similarity = 0.165 (> 0.100 thold)
slot launch_slot_: id  0 | task 2371 | processing task
slot update_slots: id  0 | task 2371 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 73
slot update_slots: id  0 | task 2371 | kv cache rm [64, end)
slot update_slots: id  0 | task 2371 | prompt processing progress, n_past = 73, n_tokens = 9, progress = 0.123288
slot update_slots: id  0 | task 2371 | prompt done, n_past = 73, n_tokens = 9
slot update_slots: id  0 | task 2371 | SWA checkpoint erase, pos_min = 0, pos_max = 85, size = 3.025 MiB
slot update_slots: id  0 | task 2371 | SWA checkpoint create, pos_min = 0, pos_max = 72, size = 2.568 MiB, total = 3/3 (8.617 MiB)
slot      release: id  0 | task 2371 | stop processing: n_past = 299, truncated = 0
slot print_timing: id  0 | task 2371 | 
prompt eval time =     164.57 ms /     9 tokens (   18.29 ms per token,    54.69 tokens per second)
       eval time =    6640.77 ms /   227 tokens (   29.25 ms per token,    34.18 tokens per second)
      total time =    6805.34 ms /   236 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Official gpt-oss-120b model output has dropped/missing tokens, can't count to 100 #16263

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Official gpt-oss-120b model output has dropped/missing tokens, can't count to 100 #16263

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions