server: Add "tokens per second" information in the backend #10548

lhpqaq · 2024-11-27T15:36:41Z

Implement #10502

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ngxson · 2024-11-27T18:25:47Z

The idea is good but I'm not confident about the UI/UX part:

Not all users want this, so it must be hidden by default (for a clean UI) and user can activate it via Settings menu
The text takes up quite a lot of space, I would prefer to make it more subtle. Take jan.ai app as an example:
For the code, we can calculate these numbers in real-time, on the frontend. This provides a better UX and allow to show t/s speed in real time.

lhpqaq · 2024-11-28T02:53:16Z

@ngxson Thank you for your suggestion. I’m also not very confident about the UI/UX part.
I have restored the frontend section and added a real-time speed field in the backend. I hope it proves useful.

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" assist"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.860551225775627}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.295321955588292}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" today"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.80692518481443}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.234142607517185}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":1561,"total_tokens":1571,"gen_second":28.003281984648602,"prompt_second":479.472377660826}}

data: [DONE]

ngxson · 2024-11-28T11:02:48Z

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

lhpqaq · 2024-11-28T12:58:50Z

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

Yes, but get_formated_timings() only calculates the final result and lacks real-time speed during the prediction process.

ngxson · 2024-11-28T18:22:56Z

It doesn't get the correct value because slot.t_token_generation is not set during generation. You can simply set it.

What I'm thinking is:

This feature should not enabled by default because it can potentially impact overall performance and network bandwidth. You should only return per-token timing if user ask for it (i.e. if user set "timing_per_token": true in the request)
It's better and more intuitive to reuse "timings" object, which is provided by get_formated_timings(), so developers don't have to re-write their code or to memorize the difference between n_gen_second vs timings
Also remember to update the documentation

ngxson

This code can be simplified further.

To pass the CI, you need to merge with latest upstream master branch

examples/server/server.cpp

examples/server/utils.hpp

lhpqaq · 2024-12-02T13:48:35Z

@ngxson Thanks~

…10548) * add cmake rvv support * add timings * remove space * update readme * fix * fix code * remove empty line * add test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

lhpqaq added 4 commits November 19, 2024 23:52

add cmake rvv support

4278480

Merge branch 'ggerganov:master' into master

3f6406f

Merge branch 'ggerganov:master' into master

daab141

Merge branch 'ggerganov:master' into master

2c96bd2

github-actions bot added examples server labels Nov 27, 2024

lhpqaq force-pushed the token branch from 1255827 to 44f5474 Compare November 28, 2024 02:38

lhpqaq changed the title ~~server: Add "tokens per second" information in the Web UI~~ server: Add "tokens per second" information in the backend Nov 28, 2024

add timings

fb10521

lhpqaq force-pushed the token branch from b4ff50b to fb10521 Compare November 29, 2024 02:21

lhpqaq added 3 commits November 29, 2024 10:28

remove space

f766ae1

update readme

c98e9a7

fix

ce32516

ngxson requested changes Nov 30, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/utils.hpp Outdated Show resolved Hide resolved

examples/server/utils.hpp Outdated Show resolved Hide resolved

lhpqaq added 3 commits November 30, 2024 17:50

Merge branch 'ggerganov:master' into token

c47c41c

fix code

21f8b73

remove empty line

1b301db

lhpqaq requested a review from ngxson November 30, 2024 16:39

slaren mentioned this pull request Dec 1, 2024

Feature Request: Add "tokens per second" information in the Web UI #10502

Closed

4 tasks

add test

28d8c91

github-actions bot added the python python script changes label Dec 2, 2024

ngxson approved these changes Dec 2, 2024

View reviewed changes

ngxson merged commit 64ed209 into ggml-org:master Dec 2, 2024
52 checks passed

lhpqaq deleted the token branch December 4, 2024 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: Add "tokens per second" information in the backend #10548

server: Add "tokens per second" information in the backend #10548

Uh oh!

lhpqaq commented Nov 27, 2024 •

edited

Loading

Uh oh!

ngxson commented Nov 27, 2024 •

edited

Loading

Uh oh!

lhpqaq commented Nov 28, 2024

Uh oh!

ngxson commented Nov 28, 2024

Uh oh!

lhpqaq commented Nov 28, 2024 •

edited

Loading

Uh oh!

ngxson commented Nov 28, 2024

Uh oh!

ngxson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhpqaq commented Dec 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

server: Add "tokens per second" information in the backend #10548

server: Add "tokens per second" information in the backend #10548

Uh oh!

Conversation

lhpqaq commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhpqaq commented Nov 28, 2024

Uh oh!

ngxson commented Nov 28, 2024

Uh oh!

lhpqaq commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 28, 2024

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhpqaq commented Dec 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhpqaq commented Nov 27, 2024 •

edited

Loading

ngxson commented Nov 27, 2024 •

edited

Loading

lhpqaq commented Nov 28, 2024 •

edited

Loading