Skip to content

Conversation

@Amirresm
Copy link

@Amirresm Amirresm commented Nov 8, 2025

This PR introduces two fixes to the inter-token latency (ITL) metric:

  • Uses a consistent output token count to calculate ITL
    • fixes an issue where streamed chunks return more than one token (e.g. with speculative decoding)
  • Exclude the time taken to generate the first token (TTFT) in calculating ITL

This fixes the metric in niche use cases (e.g. with vLLM's draft models or tool responses) and aligns the implementation with other analyzers like Nvidia GenAI-Perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant