-
Notifications
You must be signed in to change notification settings - Fork 13.4k
server: fix OpenAI Streaming API compatibility for usage statistics in chat streams #15444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix OpenAI Streaming API compatibility for usage statistics in chat streams #15444
Conversation
9d92f7b
to
6c37034
Compare
|
||
// OpenAI API spec for chat.completion.chunks specifies an empty `choices` array for the last chunk when including usage | ||
// https://platform.openai.com/docs/api-reference/chat_streaming/streaming#chat_streaming/streaming-choices | ||
deltas.push_back({ | ||
{"choices", json::array()}, | ||
{"created", t}, | ||
{"id", oaicompat_cmpl_id}, | ||
{"model", oaicompat_model}, | ||
{"system_fingerprint", build_info}, | ||
{"object", "chat.completion.chunk"}, | ||
{"usage", json { | ||
{"completion_tokens", n_decoded}, | ||
{"prompt_tokens", n_prompt_tokens}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only (non-test) PR change: adding an extra chunk with an empty choices
array and setting usage stats there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signposting that this change looks to be backwards-compatible with the bench script which checks if a chunk contains a usage
field, independent of the choices
content:
llama.cpp/tools/server/bench/script.js
Lines 128 to 136 in 1d36b36
if (chunk.usage) { prompt_tokens = chunk.usage.prompt_tokens llamacpp_prompt_tokens.add(prompt_tokens) llamacpp_prompt_tokens_total_counter.add(prompt_tokens) completions_tokens = chunk.usage.completion_tokens llamacpp_completion_tokens.add(completions_tokens) llamacpp_completion_tokens_total_counter.add(completions_tokens) }
6c37034
to
d4cca6b
Compare
c4f2bc1
to
ba37940
Compare
ba37940
to
3ec1bc7
Compare
Hello, I am getting
Could it be related to this commit? Thanks. |
Most likely! My fault for not catching what other compatibility was affected outside of tests and calling the model directly. It is most likely due to this line:
I can make a PR later today (assuming no one else gets to it first). |
The "choices" in the last chunk can be empty, so save the last non-empty in order to record the streaming response properly. Without this patch we don't properly record a streaming response after llama.cpp has been bumped to include ggml-org/llama.cpp#15444. Signed-off-by: Dorin Geman <[email protected]>
Fixes OpenAI Streaming API spec compatibility for chat streams which include usage statistics (the default in
llama-server
).Closes:
server
: Usage statistics in chat streams added to slightly different chunk from OpenAI Streaming API #15443