-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Labels
bugSomething isn't workingSomething isn't workingcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)server/webui
Description
Name and Version
version: 6586 (835b2b9)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen + RTX 3090. Tested on friend's machine with similar results.
Models
gpt-oss-120b-GGUF in mxfp4 straight from ggml-org: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF
Problem description & steps to reproduce
When I run llama-server with standard arguments, tokens are dropped from the output, severely affecting coding problems:
./build/bin/llama-server --model models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 32768 -ngl 99 -ot ".ffn_(up|down)_exps.=CPU" --jinja
Output for prompt Count from 1 to 100 and output in a code block in groups of 10.:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
2122 23 25 28 30
31 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
1 2 4 5 6 7 8 9 10 12 1314 15 16 18 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 4243 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 83 84 85 86 87 88 90
91 93 94 95 96 98 99
1 2 34 5 7 8 9 10
11 12 13 1415 1617 1819 20
21 22 23 24 25 26 27 28 29 3031 3233 3435 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
Prompt: Write a python hello world script
Response after reasoning (notice missing token in the comment):
Here’s the classic “Hello, World!” program in Python.
Save the code below to a file (e.g., `hello_world.py`) and run it with `python hello_world.py`.
```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
hello_world.py
A minimal Python script prints "Hello, World!" to the.
"""
def main() -> None:
"""Entry point of the script."""
print("Hello, World!")
if __name__ == "__main__":
main()
```
**How to run**
```bash
$ python hello_world.py
Hello, World!
```
The shebang line (`#!/usr/bin/env python3`) lets you execute the script directly on Unix‑like systems if you make it executable:
```bash
$ chmod +x hello_world.py
$ ./hello_world.py
Hello, World!
```This happens via server API (I use streaming enabled) but also on the webui:
Possibly related to vllm-project/vllm#23335
First Bad Commit
No response
Relevant log output
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv params_from_: Chat format: GPT-OSS
slot get_availabl: id 0 | task 2067 | selected slot by lcs similarity, lcs_len = 64, similarity = 0.165 (> 0.100 thold)
slot launch_slot_: id 0 | task 2371 | processing task
slot update_slots: id 0 | task 2371 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 73
slot update_slots: id 0 | task 2371 | kv cache rm [64, end)
slot update_slots: id 0 | task 2371 | prompt processing progress, n_past = 73, n_tokens = 9, progress = 0.123288
slot update_slots: id 0 | task 2371 | prompt done, n_past = 73, n_tokens = 9
slot update_slots: id 0 | task 2371 | SWA checkpoint erase, pos_min = 0, pos_max = 85, size = 3.025 MiB
slot update_slots: id 0 | task 2371 | SWA checkpoint create, pos_min = 0, pos_max = 72, size = 2.568 MiB, total = 3/3 (8.617 MiB)
slot release: id 0 | task 2371 | stop processing: n_past = 299, truncated = 0
slot print_timing: id 0 | task 2371 |
prompt eval time = 164.57 ms / 9 tokens ( 18.29 ms per token, 54.69 tokens per second)
eval time = 6640.77 ms / 227 tokens ( 29.25 ms per token, 34.18 tokens per second)
total time = 6805.34 ms / 236 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200Suiranoil
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)server/webui