Skip to content

Conversation

@taronaeo
Copy link
Collaborator

Introduces the Time to First Token (TTFT), End-to-End Latency (E2E), and Inter-token Latency (ITL) metrics. Updates the README.md to explain the calculation as well.

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 2, 2025

Hi @ggerganov @slaren, any interest in having these metrics in llama-bench? :)

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that this is necessary. It doesn't really fit all that well into the llama-bench model and will only produce meaningful results with some types of tests.

OTOH, you can already calculate all of these values if you formulate the tests properly, e.g. TTFT can be estimated with -pg <n_prompt>,1, E2E with any -pg test, and ITL with -n.

@slaren
Copy link
Member

slaren commented Sep 3, 2025

It may be more appropriate to add a python or shell script that runsllama-bench with the right tests, and calculates these values.

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 4, 2025

It may be more appropriate to add a python or shell script that runs llama-bench with the right tests, and calculates these values.

Hmm yeah thats more reasonable. I can move the metrics into a Python script but I would prefer that we at least still log the TTFT within llama-bench so that we don't have to do 2 benchmark runs (i.e., -pg 512,1 and -pg 512,128) to generate the metrics.

This is because at least on IBM Z & LinuxONE, most of our users are running Type-2 virtualisation and the benchmark runs can have data varying far apart from one another, invalidating the results from the first run for TTFT as now they're inaccurate.

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 4, 2025

Let me know if its okay to keep the samples_ttft_ns/samples_ttft_ms within llama-bench to avoid the data inaccuracy I mentioned above

@slaren
Copy link
Member

slaren commented Sep 4, 2025

Let me know if its okay to keep the samples_ttft_ns/samples_ttft_ms within llama-bench to avoid the data inaccuracy I mentioned above

This still seems too specific, however, having an option to export in the json formats the timing of every token generated could be useful. It would also allow other use cases, such as generating very detailed graphs of performance vs context depth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants