-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Include speculative decoding stats when timings_per_token is enabled #12603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
New fields added to the `timings` object: - draft_n : number of draft tokens generated - draft_accepted_n : number of draft tokens accepted - draft_accept_ratio: ratio of accepted/generated
|
The code looks good to me, but I have too little knowledge about speculative decoding to know if the logic is correct. Would you please have a quick look @ggerganov ? Thanks! |
examples/server/server.cpp
Outdated
| slot.n_draft_accepted += ids.size() - 1; // exclude last sampled token | ||
| if (slot.n_draft_total > 0) { | ||
| slot.draft_accept_ratio = (float)slot.n_draft_accepted / slot.n_draft_total; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved above in the loop so that send_final_response() is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the slot.draft_accept_ratio was removed with the redundant var.
I think slot.n_draft_accepted looks like it is in the right place. I moved it up a bit in the code so it is closer to where ids is declared.
|
Is there any way this could be added to the stats printed by It currently prints the timings and token counts in the console, but I couldn't see a way to get the acceptance rate printed when somebody asked me about it in this HF thread. |
|
@ggerganov thanks for the review. I updated the PR with your suggestions; removing the redundant variable and making where the values are calculated a bit clearer. |
Looks like it: @ggerganov what do you think of that additional logging? |
Yes, we can add a LOG message for that. I think it's already present in the debug logs, but an info log would also be OK. |
New fields added to the
timingsobject:Sample of output
These were done on an M1 Pro, 32GB
Server:
streaming off
streaming on