-
Notifications
You must be signed in to change notification settings - Fork 13.3k
server : return HTTP 400 if prompt exceeds context length #16486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hmm that's strange, we have a specific error type for this, llama.cpp/tools/server/server.cpp Lines 1268 to 1271 in 56b4795
We also have this test case: llama.cpp/tools/server/tests/unit/test_chat_completion.py Lines 393 to 408 in 56b4795
I'm wondering which input leads to the 200 code that you mentioned? |
The issue occurs only in streaming mode. In non-streaming it correctly returns 400. |
In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.
aac559d
to
1d8b16c
Compare
I have added a new test which covers exceeding the context in streaming mode. |
if (!ctx_server.params_base.ctx_shift && n_prompt_tokens >= n_ctx_slot) { | ||
json error_data = format_error_response("the request exceeds the available context size. try increasing the context size or enable context shift", ERROR_TYPE_EXCEED_CONTEXT_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prompt truncation functionality is being removed in #16391:

So no longer need to check ctx_shift
here and respectively no need to suggest enabling it in the error.
In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case.
This patch fixes this problem and makes the server return HTTP 400 in such cases.