Include speculative decoding stats when timings_per_token is enabled #12603

mostlygeek · 2025-03-27T06:45:30Z

New fields added to the timings object:

draft_n : number of draft tokens generated
draft_n_accepted : number of draft tokens accepted

Sample of output

These were done on an M1 Pro, 32GB

Server:

$ ./build/bin/llama-server --no-mmap --no-warmup \
  --model /Volumes/devdisk/llama-swap/models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --model-draft /Volumes/devdisk/llama-swap/models/qwen2.5-0.5b-instruct-q8_0.gguf \
  -ngl 99 -ngld 99 \
  --draft-max 16 --draft-min 4 --draft-p-min 0.4

streaming off

$ curl -s http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"qwen","max_tokens":200, "timings_per_token":true, "messages": [{"role": "user","content": "write a story about dogs"}]}' | jq .timings

{
  "prompt_n": 1,
  "prompt_ms": 228.522,
  "prompt_per_token_ms": 228.522,
  "prompt_per_second": 4.375946298387026,
  "predicted_n": 200,
  "predicted_ms": 11082.606,
  "predicted_per_token_ms": 55.41303,
  "predicted_per_second": 18.0462970532382,
  "draft_n": 347,
  "draft_n_accepted": 65
}

streaming on

$ curl -sn http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen","max_tokens":200, "stream":true, "timings_per_token":true, "messages": [{"role": "user","content": "write a story about dogs"}]}' | jq -cR 'sub("^data: "; "") | fromjson? | .timings | {predicted_n, predicted_per_second, draft_n, draft_n_accepted, draft_accept_ratio: (if .draft_n > 0 then .draft_n_accepted / .draft_n else null end)}'

{"predicted_n":1,"predicted_per_second":null,"draft_n":null,"draft_n_accepted":null,"draft_accept_ratio":null}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
... 
{"predicted_n":197,"predicted_per_second":17.71554376197935,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":198,"predicted_per_second":17.740989615697014,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":199,"predicted_per_second":17.76501315459052,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":200,"predicted_per_second":17.79046198182722,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":200,"predicted_per_second":17.790439826817185,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}

New fields added to the `timings` object: - draft_n : number of draft tokens generated - draft_accepted_n : number of draft tokens accepted - draft_accept_ratio: ratio of accepted/generated

ngxson · 2025-03-27T10:09:30Z

The code looks good to me, but I have too little knowledge about speculative decoding to know if the logic is correct. Would you please have a quick look @ggerganov ? Thanks!

ggerganov · 2025-03-27T10:41:35Z

examples/server/server.cpp

+                slot.n_draft_accepted += ids.size() - 1; // exclude last sampled token
+                if (slot.n_draft_total > 0) {
+                    slot.draft_accept_ratio = (float)slot.n_draft_accepted / slot.n_draft_total;
+                }


Should this be moved above in the loop so that send_final_response() is correct?

the slot.draft_accept_ratio was removed with the redundant var.

I think slot.n_draft_accepted looks like it is in the right place. I moved it up a bit in the code so it is closer to where ids is declared.

examples/server/server.cpp

jukofyork · 2025-03-27T12:58:13Z

Is there any way this could be added to the stats printed by llama-server in the console after completing the generation?

It currently prints the timings and token counts in the console, but I couldn't see a way to get the acceptance rate printed when somebody asked me about it in this HF thread.

mostlygeek · 2025-03-27T20:30:07Z

@ggerganov thanks for the review. I updated the PR with your suggestions; removing the redundant variable and making where the values are calculated a bit clearer.

mostlygeek · 2025-03-27T20:58:07Z

Is there any way this could be added to the stats printed by llama-server in the console after completing the generation?

Looks like it:

prompt eval time =     393.25 ms /    13 tokens (   30.25 ms per token,    33.06 tokens per second)
       eval time =    9840.04 ms /   200 tokens (   49.20 ms per token,    20.33 tokens per second)
      total time =   10233.29 ms /   213 tokens
slot print_timing: id  0 | task 0 |
draft acceptance rate = 0.23026 (   70 accepted /   304 generated)    <--- added this output

@ggerganov what do you think of that additional logging?

ggerganov · 2025-03-28T08:00:52Z

Is there any way this could be added to the stats printed by llama-server in the console after completing the generation?

Yes, we can add a LOG message for that. I think it's already present in the debug logs, but an info log would also be OK.

Include speculative decoding stats when timings_per_token is true

2dc2918

New fields added to the `timings` object: - draft_n : number of draft tokens generated - draft_accepted_n : number of draft tokens accepted - draft_accept_ratio: ratio of accepted/generated

mostlygeek requested a review from ngxson as a code owner March 27, 2025 06:45

github-actions bot added examples server labels Mar 27, 2025

ngxson approved these changes Mar 27, 2025

View reviewed changes

ggerganov reviewed Mar 27, 2025

View reviewed changes

Remove redundant draft_accept_ratio var

41a8e85

add draft acceptance rate to server console output

429820e

mostlygeek closed this Mar 27, 2025

mostlygeek reopened this Mar 27, 2025

ggerganov approved these changes Mar 28, 2025

View reviewed changes

mostlygeek requested a review from ggerganov March 28, 2025 08:03

ggerganov merged commit 5d01670 into ggml-org:master Mar 28, 2025
53 of 96 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Include speculative decoding stats when timings_per_token is enabled #12603

Include speculative decoding stats when timings_per_token is enabled #12603

Uh oh!

mostlygeek commented Mar 27, 2025 •

edited

Loading

Uh oh!

ngxson commented Mar 27, 2025

Uh oh!

ggerganov Mar 27, 2025

Uh oh!

mostlygeek Mar 27, 2025

Uh oh!

Uh oh!

Uh oh!

jukofyork commented Mar 27, 2025

Uh oh!

mostlygeek commented Mar 27, 2025

Uh oh!

mostlygeek commented Mar 27, 2025

Uh oh!

ggerganov commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Include speculative decoding stats when timings_per_token is enabled #12603

Include speculative decoding stats when timings_per_token is enabled #12603

Uh oh!

Conversation

mostlygeek commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample of output

Server:

streaming off

streaming on

Uh oh!

ngxson commented Mar 27, 2025

Uh oh!

ggerganov Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

mostlygeek Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jukofyork commented Mar 27, 2025

Uh oh!

mostlygeek commented Mar 27, 2025

Uh oh!

mostlygeek commented Mar 27, 2025

Uh oh!

ggerganov commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mostlygeek commented Mar 27, 2025 •

edited

Loading