Misc. bug: OpenAI HTTP interface returns "HTTP-200" with error details in streamed chunk

### Name and Version

```
llama-server --version
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
version: 5630 (4c763c8d)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf
```

### Problem description & steps to reproduce

When using the OpenAI HTTP interface for interacting with the LLM using streaming, the HTTP interface returns an `HTTP/1.1 200 OK` even for invalid inputs, returning the error while interacting with the LLM as part of the streamed response.

This makes automating clients somewhat tricky, and - more importantly - does not behave like other inference servers does.

A simple way to reproduce this is to load a model that has a limited context window (e.g., gemma):

```bash
/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf
```

then generate a rather large context using a chat-completion with `stream=True`:

```python
import json

d = {
    "temperature": 0.0,
    "n": 1,
    "stop": ["End"],
    "stream": True,
    "model": "gemma",
    "messages": [
        {
            "role": "user",
            "content": "The quick brown fox shows that llama.cpp's OpenAI interface does something weird. \n"
            * 10000,
        }
    ],
}
with open("data.json", encoding="ascii", mode="w") as fp:
    json.dump(d, fp=fp)
```

and then send the above body to the LLM:

```
curl --location 'http://127.0.0.1:3080/chat/completions' --header 'Content-Type: application/json' --header 'Accept: application/json' --data @/tmp/data.json -v
```

*Expected behavior:* The server returns `HTTP-400` saying that the request exceeds the available context size.

*Actual behavior:* the server returns `HTTP-200` but then later streams an "error junk":

```
curl ...
[...]
< HTTP/1.1 200 OK
< Keep-Alive: timeout=5, max=100
< Content-Type: text/event-stream
< Server: llama.cpp
< Transfer-Encoding: chunked
< Access-Control-Allow-Origin:
<
error: {"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}

data: [DONE]
```

### First Bad Commit

_No response_

### Relevant log output

```shell
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 201 | processing task
slot update_slots: id  0 | task 201 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 190008
slot      release: id  0 | task 201 | stop processing: n_past = 0, truncated = 0
srv    send_error: task id = 201, error: the request exceeds the available context size. try increasing the context size or enable context shift
srv  update_slots: no tokens to decode
srv  update_slots: all slots are idle
srv  cancel_tasks: cancel task, id_task = 201
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  update_slots: all slots are idle
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: OpenAI HTTP interface returns "HTTP-200" with error details in streamed chunk #14566

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: OpenAI HTTP interface returns "HTTP-200" with error details in streamed chunk #14566

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions