Skip to content

Commit 1c1970f

Browse files
Di Xu (SWE)facebook-github-bot
authored andcommitted
Support more breakdown of latency metrics/stats for Llama (pytorch#6072)
Summary: Support more breakdown of latency metrics/stats for Llama - This is needed when we debugging the Frame-LLM project across teams Reviewed By: cccclai Differential Revision: D64139460
1 parent 69c2c76 commit 1c1970f

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

extension/llm/runner/stats.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,14 @@ struct Stats {
2929
long model_load_end_ms;
3030
// inference_start_ms: Immediately after the model is loaded (or we check
3131
// for model load), measure the inference time.
32+
// NOTE: It's actually the tokenizer encode + model execution time.
3233
long inference_start_ms;
34+
// End of the tokenizer encode time.
35+
long token_encode_end_ms;
36+
// Start of the model execution (forward function) time.
37+
long model_execution_start_ms;
38+
// End of the model execution (forward function) time.
39+
long model_execution_end_ms;
3340
// prompt_eval_end_ms: Prompt array allocation and tokenization. Ends right
3441
// before the inference loop starts
3542
long prompt_eval_end_ms;

0 commit comments

Comments
 (0)