Skip to content

Conversation

@lhpqaq
Copy link
Contributor

@lhpqaq lhpqaq commented Nov 27, 2024

Implement #10502

image image

@ngxson
Copy link
Collaborator

ngxson commented Nov 27, 2024

The idea is good but I'm not confident about the UI/UX part:

  1. Not all users want this, so it must be hidden by default (for a clean UI) and user can activate it via Settings menu

  2. The text takes up quite a lot of space, I would prefer to make it more subtle. Take jan.ai app as an example:
    image

  3. For the code, we can calculate these numbers in real-time, on the frontend. This provides a better UX and allow to show t/s speed in real time.

@lhpqaq
Copy link
Contributor Author

lhpqaq commented Nov 28, 2024

@ngxson Thank you for your suggestion. I’m also not very confident about the UI/UX part.
I have restored the frontend section and added a real-time speed field in the backend. I hope it proves useful.

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" assist"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.860551225775627}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.295321955588292}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" today"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.80692518481443}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.234142607517185}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":1561,"total_tokens":1571,"gen_second":28.003281984648602,"prompt_second":479.472377660826}}

data: [DONE]

@lhpqaq lhpqaq changed the title server: Add "tokens per second" information in the Web UI server: Add "tokens per second" information in the backend Nov 28, 2024
@ngxson
Copy link
Collaborator

ngxson commented Nov 28, 2024

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

@lhpqaq
Copy link
Contributor Author

lhpqaq commented Nov 28, 2024

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

Yes, but get_formated_timings() only calculates the final result and lacks real-time speed during the prediction process.
image

@ngxson
Copy link
Collaborator

ngxson commented Nov 28, 2024

It doesn't get the correct value because slot.t_token_generation is not set during generation. You can simply set it.

What I'm thinking is:

  • This feature should not enabled by default because it can potentially impact overall performance and network bandwidth. You should only return per-token timing if user ask for it (i.e. if user set "timing_per_token": true in the request)
  • It's better and more intuitive to reuse "timings" object, which is provided by get_formated_timings(), so developers don't have to re-write their code or to memorize the difference between n_gen_second vs timings
  • Also remember to update the documentation

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be simplified further.

To pass the CI, you need to merge with latest upstream master branch

@github-actions github-actions bot added the python python script changes label Dec 2, 2024
@ngxson ngxson merged commit 64ed209 into ggml-org:master Dec 2, 2024
52 checks passed
@lhpqaq
Copy link
Contributor Author

lhpqaq commented Dec 2, 2024

@ngxson Thanks~

@lhpqaq lhpqaq deleted the token branch December 4, 2024 07:49
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
…10548)

* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…10548)

* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants