Support more breakdown of latency metrics/stats for Llama (pytorch#6072)

Di Xu (SWE) · facebook-github-bot · commit 1c1970f7af3d · 2024-10-09T23:53:17.000-07:00
Summary:

Support more breakdown of latency metrics/stats for Llama
- This is needed when we debugging the Frame-LLM project across teams

Reviewed By: cccclai

Differential Revision: D64139460
diff --git a/extension/llm/runner/stats.h b/extension/llm/runner/stats.h
@@ -29,7 +29,14 @@ struct Stats {
   long model_load_end_ms;
   // inference_start_ms: Immediately after the model is loaded (or we check
   // for model load), measure the inference time.
+  // NOTE: It's actually the tokenizer encode + model execution time.
   long inference_start_ms;
+  // End of the tokenizer encode time.
+  long token_encode_end_ms;
+  // Start of the model execution (forward function) time.
+  long model_execution_start_ms;
+  // End of the model execution (forward function) time.
+  long model_execution_end_ms;
   // prompt_eval_end_ms: Prompt array allocation and tokenization. Ends right
   // before the inference loop starts
   long prompt_eval_end_ms;