Skip to content

Conversation

TeoZosa
Copy link
Contributor

@TeoZosa TeoZosa commented Aug 20, 2025

Fixes OpenAI Streaming API spec compatibility for chat streams which include usage statistics (the default in llama-server).

Closes:

@TeoZosa TeoZosa requested a review from ngxson as a code owner August 20, 2025 07:58
@TeoZosa TeoZosa force-pushed the server/openai-api-spec-compatibility/chat-completion-chunk-usage-statistics-chunk branch from 9d92f7b to 6c37034 Compare August 20, 2025 07:59
Comment on lines +915 to 927

// OpenAI API spec for chat.completion.chunks specifies an empty `choices` array for the last chunk when including usage
// https://platform.openai.com/docs/api-reference/chat_streaming/streaming#chat_streaming/streaming-choices
deltas.push_back({
{"choices", json::array()},
{"created", t},
{"id", oaicompat_cmpl_id},
{"model", oaicompat_model},
{"system_fingerprint", build_info},
{"object", "chat.completion.chunk"},
{"usage", json {
{"completion_tokens", n_decoded},
{"prompt_tokens", n_prompt_tokens},
Copy link
Contributor Author

@TeoZosa TeoZosa Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only (non-test) PR change: adding an extra chunk with an empty choices array and setting usage stats there.

Copy link
Contributor Author

@TeoZosa TeoZosa Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signposting that this change looks to be backwards-compatible with the bench script which checks if a chunk contains a usage field, independent of the choices content:

  • if (chunk.usage) {
    prompt_tokens = chunk.usage.prompt_tokens
    llamacpp_prompt_tokens.add(prompt_tokens)
    llamacpp_prompt_tokens_total_counter.add(prompt_tokens)
    completions_tokens = chunk.usage.completion_tokens
    llamacpp_completion_tokens.add(completions_tokens)
    llamacpp_completion_tokens_total_counter.add(completions_tokens)
    }

@TeoZosa TeoZosa force-pushed the server/openai-api-spec-compatibility/chat-completion-chunk-usage-statistics-chunk branch from 6c37034 to d4cca6b Compare August 20, 2025 08:15
@github-actions github-actions bot added the python python script changes label Aug 20, 2025
@TeoZosa TeoZosa force-pushed the server/openai-api-spec-compatibility/chat-completion-chunk-usage-statistics-chunk branch 3 times, most recently from c4f2bc1 to ba37940 Compare August 20, 2025 09:59
@TeoZosa TeoZosa force-pushed the server/openai-api-spec-compatibility/chat-completion-chunk-usage-statistics-chunk branch from ba37940 to 3ec1bc7 Compare August 20, 2025 11:58
@ngxson ngxson merged commit 1bc664a into ggml-org:master Aug 20, 2025
49 checks passed
@h9j6k
Copy link

h9j6k commented Aug 21, 2025

Hello,

I am getting can't access property "delta", ie.choices[0] is undefined in a browser pop-up error msg when using llama-server as before.

./llama-server -m ~/llm/models/google_gemma-3n-E4B-it-Q4_0.gguf -ot per_layer_token_embd.weight=CPU --host HOSTNAME --port PORT -c 8192 -b 8192 -e -ngl 99 -t 8 -n -1 --no-mmap -fa --jinja

Could it be related to this commit? Thanks.

@TeoZosa
Copy link
Contributor Author

TeoZosa commented Aug 21, 2025

Hello,

I am getting can't access property "delta", ie.choices[0] is undefined in a browser pop-up error msg when using llama-server as before.

./llama-server -m ~/llm/models/google_gemma-3n-E4B-it-Q4_0.gguf -ot per_layer_token_embd.weight=CPU --host HOSTNAME --port PORT -c 8192 -b 8192 -e -ngl 99 -t 8 -n -1 --no-mmap -fa --jinja

Could it be related to this commit? Thanks.

Most likely! My fault for not catching what other compatibility was affected outside of tests and calling the model directly. It is most likely due to this line:

I can make a PR later today (assuming no one else gets to it first).

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 22, 2025
doringeman added a commit to doringeman/model-runner that referenced this pull request Sep 11, 2025
The "choices" in the last chunk can be empty, so save the last non-empty in order to record the streaming response properly. Without this patch we don't properly record a streaming response after llama.cpp has been bumped to include ggml-org/llama.cpp#15444.

Signed-off-by: Dorin Geman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants