-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
llama-server --version
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
version: 5630 (4c763c8d)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf
Problem description & steps to reproduce
When using the OpenAI HTTP interface for interacting with the LLM using streaming, the HTTP interface returns an HTTP/1.1 200 OK
even for invalid inputs, returning the error while interacting with the LLM as part of the streamed response.
This makes automating clients somewhat tricky, and - more importantly - does not behave like other inference servers does.
A simple way to reproduce this is to load a model that has a limited context window (e.g., gemma):
/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf
then generate a rather large context using a chat-completion with stream=True
:
import json
d = {
"temperature": 0.0,
"n": 1,
"stop": ["End"],
"stream": True,
"model": "gemma",
"messages": [
{
"role": "user",
"content": "The quick brown fox shows that llama.cpp's OpenAI interface does something weird. \n"
* 10000,
}
],
}
with open("data.json", encoding="ascii", mode="w") as fp:
json.dump(d, fp=fp)
and then send the above body to the LLM:
curl --location 'http://127.0.0.1:3080/chat/completions' --header 'Content-Type: application/json' --header 'Accept: application/json' --data @/tmp/data.json -v
Expected behavior: The server returns HTTP-400
saying that the request exceeds the available context size.
Actual behavior: the server returns HTTP-200
but then later streams an "error junk":
curl ...
[...]
< HTTP/1.1 200 OK
< Keep-Alive: timeout=5, max=100
< Content-Type: text/event-stream
< Server: llama.cpp
< Transfer-Encoding: chunked
< Access-Control-Allow-Origin:
<
error: {"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}
data: [DONE]
First Bad Commit
No response
Relevant log output
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 201 | processing task
slot update_slots: id 0 | task 201 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 190008
slot release: id 0 | task 201 | stop processing: n_past = 0, truncated = 0
srv send_error: task id = 201, error: the request exceeds the available context size. try increasing the context size or enable context shift
srv update_slots: no tokens to decode
srv update_slots: all slots are idle
srv cancel_tasks: cancel task, id_task = 201
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
srv update_slots: all slots are idle