Skip to content

Conversation

@mostlygeek
Copy link
Contributor

@mostlygeek mostlygeek commented Mar 27, 2025

New fields added to the timings object:

  • draft_n : number of draft tokens generated
  • draft_n_accepted : number of draft tokens accepted

Sample of output

These were done on an M1 Pro, 32GB

Server:

$ ./build/bin/llama-server --no-mmap --no-warmup \
  --model /Volumes/devdisk/llama-swap/models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --model-draft /Volumes/devdisk/llama-swap/models/qwen2.5-0.5b-instruct-q8_0.gguf \
  -ngl 99 -ngld 99 \
  --draft-max 16 --draft-min 4 --draft-p-min 0.4

streaming off

$ curl -s http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"qwen","max_tokens":200, "timings_per_token":true, "messages": [{"role": "user","content": "write a story about dogs"}]}' | jq .timings

{
  "prompt_n": 1,
  "prompt_ms": 228.522,
  "prompt_per_token_ms": 228.522,
  "prompt_per_second": 4.375946298387026,
  "predicted_n": 200,
  "predicted_ms": 11082.606,
  "predicted_per_token_ms": 55.41303,
  "predicted_per_second": 18.0462970532382,
  "draft_n": 347,
  "draft_n_accepted": 65
}

streaming on

$ curl -sn http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen","max_tokens":200, "stream":true, "timings_per_token":true, "messages": [{"role": "user","content": "write a story about dogs"}]}' | jq -cR 'sub("^data: "; "") | fromjson? | .timings | {predicted_n, predicted_per_second, draft_n, draft_n_accepted, draft_accept_ratio: (if .draft_n > 0 then .draft_n_accepted / .draft_n else null end)}'

{"predicted_n":1,"predicted_per_second":null,"draft_n":null,"draft_n_accepted":null,"draft_accept_ratio":null}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
{"predicted_n":8,"predicted_per_second":null,"draft_n":16,"draft_n_accepted":6,"draft_accept_ratio":0.375}
... 
{"predicted_n":197,"predicted_per_second":17.71554376197935,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":198,"predicted_per_second":17.740989615697014,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":199,"predicted_per_second":17.76501315459052,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":200,"predicted_per_second":17.79046198182722,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}
{"predicted_n":200,"predicted_per_second":17.790439826817185,"draft_n":335,"draft_n_accepted":72,"draft_accept_ratio":0.21492537313432836}

New fields added to the `timings` object:

  - draft_n           : number of draft tokens generated
  - draft_accepted_n  : number of draft tokens accepted
  - draft_accept_ratio: ratio of accepted/generated
@ngxson
Copy link
Collaborator

ngxson commented Mar 27, 2025

The code looks good to me, but I have too little knowledge about speculative decoding to know if the logic is correct. Would you please have a quick look @ggerganov ? Thanks!

slot.n_draft_accepted += ids.size() - 1; // exclude last sampled token
if (slot.n_draft_total > 0) {
slot.draft_accept_ratio = (float)slot.n_draft_accepted / slot.n_draft_total;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved above in the loop so that send_final_response() is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the slot.draft_accept_ratio was removed with the redundant var.

I think slot.n_draft_accepted looks like it is in the right place. I moved it up a bit in the code so it is closer to where ids is declared.

@jukofyork
Copy link
Collaborator

Is there any way this could be added to the stats printed by llama-server in the console after completing the generation?

It currently prints the timings and token counts in the console, but I couldn't see a way to get the acceptance rate printed when somebody asked me about it in this HF thread.

@mostlygeek
Copy link
Contributor Author

@ggerganov thanks for the review. I updated the PR with your suggestions; removing the redundant variable and making where the values are calculated a bit clearer.

@mostlygeek
Copy link
Contributor Author

Is there any way this could be added to the stats printed by llama-server in the console after completing the generation?

Looks like it:

prompt eval time =     393.25 ms /    13 tokens (   30.25 ms per token,    33.06 tokens per second)
       eval time =    9840.04 ms /   200 tokens (   49.20 ms per token,    20.33 tokens per second)
      total time =   10233.29 ms /   213 tokens
slot print_timing: id  0 | task 0 |
draft acceptance rate = 0.23026 (   70 accepted /   304 generated)    <--- added this output

@ggerganov what do you think of that additional logging?

@mostlygeek mostlygeek closed this Mar 27, 2025
@mostlygeek mostlygeek reopened this Mar 27, 2025
@ggerganov
Copy link
Member

Is there any way this could be added to the stats printed by llama-server in the console after completing the generation?

Yes, we can add a LOG message for that. I think it's already present in the debug logs, but an info log would also be OK.

@mostlygeek mostlygeek requested a review from ggerganov March 28, 2025 08:03
@ggerganov ggerganov merged commit 5d01670 into ggml-org:master Mar 28, 2025
53 of 96 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants