Skip to content

Conversation

rgerganov
Copy link
Collaborator

In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in such cases.

@github-actions github-actions bot added examples python python script changes server labels Oct 9, 2025
@ngxson
Copy link
Collaborator

ngxson commented Oct 9, 2025

Hmm that's strange, we have a specific error type for this, ERROR_TYPE_EXCEED_CONTEXT_SIZE. The error code is 400:

case ERROR_TYPE_EXCEED_CONTEXT_SIZE:
type_str = "exceed_context_size_error";
code = 400;
break;

We also have this test case:

def test_context_size_exceeded():
global server
server.start()
res = server.make_request("POST", "/chat/completions", data={
"messages": [
{"role": "system", "content": "Book"},
{"role": "user", "content": "What is the best book"},
] * 100, # make the prompt too long
})
assert res.status_code == 400
assert "error" in res.body
assert res.body["error"]["type"] == "exceed_context_size_error"
assert res.body["error"]["n_prompt_tokens"] > 0
assert server.n_ctx is not None
assert server.n_slots is not None
assert res.body["error"]["n_ctx"] == server.n_ctx // server.n_slots

I'm wondering which input leads to the 200 code that you mentioned?

@rgerganov
Copy link
Collaborator Author

The issue occurs only in streaming mode. In non-streaming it correctly returns 400.

In streaming mode when prompt exceeds context length, the server returns
HTTP 200 status code with a JSON error in the body.  This is very
confusing and inconsistent with all other inference engines which return
HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in
such cases.
@rgerganov rgerganov force-pushed the srv-ctx-exceed branch 2 times, most recently from aac559d to 1d8b16c Compare October 9, 2025 15:41
@rgerganov
Copy link
Collaborator Author

I have added a new test which covers exceeding the context in streaming mode.

Comment on lines +4649 to +4650
if (!ctx_server.params_base.ctx_shift && n_prompt_tokens >= n_ctx_slot) {
json error_data = format_error_response("the request exceeds the available context size. try increasing the context size or enable context shift", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt truncation functionality is being removed in #16391:

image

So no longer need to check ctx_shift here and respectively no need to suggest enabling it in the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants