Change time-to-first-token parameter to be based on number of request tokens #137 #165

pancak3 · 2025-08-24T14:47:04Z

resolve #137

mayabar · 2025-08-25T12:51:47Z

@pancak3 thanks for working on this issue.

I ave some general questions:

Dependency between prefill time and number of input tokens. In the sources I read I saw that for not too long requests the dependency is "near-linear". Prefill-time=C+n*t, where C is fixed overhead and t is prefill time per token, and n is the input token count. Do you have evidence for quadratical or logarithmic dependency?
Regarding kv-cache time per token, currently it is used only when prefill is executed on another vllm pod, and mimics time for transferring kv-cache to the current pod. We added recently ability to cache kv-blocks (it was done to be able to send kv events properly), in future we want for each request check if its prompt has part in kv-cache and calculate prefill time accordingly, but this is in a separate stage.
In the issue associated with this PR we talked about prefill-time-per-token (PTPT), in addition, we talked about prefill-overhead that is the contain time which does not depends on number of input tokens, in case the time dependency is linear the formula will be: total_prefill_time = prefill-overhead + num-of-tokens * PTPT.

pancak3 · 2025-08-28T16:44:25Z

@mayabar sorry to get back late.

Dependency between prefill time and number of input tokens. In the sources I read I saw that for not too long requests the dependency is "near-linear". Prefill-time=C+n*t, where C is fixed overhead and t is prefill time per token, and n is the input token count. Do you have evidence for quadratical or logarithmic dependency?

In the issue associated with this PR we talked about prefill-time-per-token (PTPT), in addition, we talked about prefill-overhead that is the contain time which does not depends on number of input tokens, in case the time dependency is linear the formula will be: total_prefill_time = prefill-overhead + num-of-tokens * PTPT.

I have to admit I haven't read deeply to a source but implement the previous code by impression of LLMs. Got some time reading the paper Attention Is All You Need,

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$

Convert it to computation complexity, my understanding is that, it has $O(n^2d/k)$ complexity, where n is the length of input prompts, d can simply be the model complexity, and k is the GPU performance (parallelism units). When n is less than k, the complexity could be simplified from quadratics $O(n^2d/k)$ to "near-linear" $O(nd)$.
As we are simulating, at this stage (after reading roadmap P1 plan), I agree with the simple implementation (Prefill-time=C+n*t).
But I am open to working on features for more flexible configuration ( $d=f(d_{tokenization}, d_{embedding}, d_k, d_v)$, consider k). Let me know if further discussion is needed.

Regarding kv-cache time per token, currently it is used only when prefill is executed on another vllm pod, and mimics time for transferring kv-cache to the current pod. We added recently ability to cache kv-blocks (it was done to be able to send kv events properly), in future we want for each request check if its prompt has part in kv-cache and calculate prefill time accordingly, but this is in a separate stage.

Lovely, I will change kv cache transfer implementation, removing the "complexity" part. The calculation isKV-cache-transfer-time = n * transfer-overhead and transfer-overhead will be the argument. Let me know if kv-cache-tranfer-overhead or transfter-overhead or any other naming is preferred.

Also, I am not sure whether the network quality between pods is considered in the following phases (did not find any clue in the public roadmap doc), but personally think it is good to consider networking in the sim behavior.

One more thing to confirm, for both Prefill-time and KV-cache-transfer-time, they will be used as mean, along with their own std-dev, to get normalzed times for "P1 - Timing randomization using statistical parameters". Thank you!

Signed-off-by: Qifan Deng <[email protected]>

…overhead Signed-off-by: Qifan Deng <[email protected]>

Signed-off-by: Qifan Deng <[email protected]>

…r-overhead Signed-off-by: Qifan Deng <[email protected]>

Signed-off-by: Qifan Deng <[email protected]>

…used Signed-off-by: Qifan Deng <[email protected]>

Signed-off-by: Qifan Deng <[email protected]>

pancak3 · 2025-08-29T14:23:41Z

Hi @mayabar, sorry to block other issues, pelase review the latest change, thanks!

Signed-off-by: Qifan Deng <[email protected]>

This reverts commit 18d3075. Signed-off-by: Qifan Deng <[email protected]>

mayabar · 2025-08-31T12:58:53Z

pkg/llm-d-inference-sim/simulator.go

+	if !doRemotePrefill {
+		mean := float64(s.config.TimeToFirstToken)
+		stddev := float64(s.config.TimeToFirstTokenStdDev)
+		return int(common.RandomNorm(mean, stddev))


it may stay the old code where mean and stddev are calculated according the doRemotePrefill and randomInt is called from one plase

this is resolved in the below change

mayabar · 2025-08-31T13:00:37Z

pkg/llm-d-inference-sim/simulator.go

+// calc the prefill overhead against number of tokens
+func (s *VllmSimulator) calcPrefillOverhead(nPromptTokens int, doRemotePrefill bool) int {
+	if !doRemotePrefill {
+		constOverhead := s.config.PrefillOverhead


additional variables are not required, we can use config properties directly in the calculation or function call

mayabar · 2025-08-31T13:13:11Z

pkg/llm-d-inference-sim/simulator.go

+		return int(common.RandomNorm(float64(prefillTime), float64(stdDev)))
+	}
+
+	if s.config.KVCacheTransferLatency != 0 || s.config.KVCacheTransferLatencyStdDev != 0 {


this check should be outside this function since it called only if time-to-first-token is defined.
in the current logic it's not possible to use for requests without disaggregated PD constant time for prefill (time-to-first-token is not 0) but in case of PD - calculate prefill time based on number of tokens

I suggest to check first doRemotePrefill and take care of each time calculation separately.
Like:

if PD

if time depends on input length ---> prefill-time=kvcache-transfer * num-of-tokens

if time does not depends on input length ---> prefill-time=kvcache-tranfer-time

if not PD

if time depends on input length ---> prefill-time=overhead + ptpt * num-of-tokens

if time does not depends on input length ---> prefill-time=time-to-first-token

thanks for this clear suggestion! the logic is now pretty intuitive, please review

mayabar · 2025-08-31T13:27:58Z

pkg/openai-server-api/request.go

 	// GetMaxCompletionTokens returns the maximum completion tokens requested
 	GetMaxCompletionTokens() *int64
-	// IsDoRemoteDecode() returns true if do_remote_decode field is true in the request, this means that this is prefill request
+	// IsDoRemoteDecode() returns true if do_remote_decode field is true in the request, this means that this is decode request


do_remote_prefill=true in the request's payload means that for this request the prefill stage was done on a remote pod and the current simulator instance should process only the decode part (kv-cache copies from another pod)

do_remote_decode=true in the request's payload means that for this request the decode stage will be done on a remote pod and the current simulator instance should process only the prefill part

if both properties are false - both prefill and decode are executed locally

This is a little bit confusing but the comment is according the NIXL documentation ;) Maybe it worth to add longer explanation

I was caught : D
I added some words, please review, thanks!

Signed-off-by: Qifan Deng <[email protected]>

mayabar · 2025-09-01T06:31:43Z

/lgtm

/approve

pancak3 marked this pull request as draft August 24, 2025 14:47

pancak3 changed the base branch from main to dev August 25, 2025 05:20

pancak3 changed the base branch from dev to main August 25, 2025 05:21

pancak3 force-pushed the dev/prefill-overhead branch from 87ae0a3 to 9da55d3 Compare August 25, 2025 11:53

pancak3 marked this pull request as ready for review August 25, 2025 11:58

pancak3 marked this pull request as draft August 29, 2025 09:25

pancak3 added 8 commits August 29, 2025 17:42

Fix comments on prefill arg in completion request interface

0b7c39e

Signed-off-by: Qifan Deng <[email protected]>

Add feature of calc ttft by prefill overhead. TODO: kvcache transfer …

e0d61de

…overhead Signed-off-by: Qifan Deng <[email protected]>

Rename prefill-overhead-complexity to prefill-complexity

a199aea

Signed-off-by: Qifan Deng <[email protected]>

Calc kv cache transfer overhead based on prompt length

cecb32c

Signed-off-by: Qifan Deng <[email protected]>

Add invalid test cases for args prefill-overhead and kv-cache-transfe…

0c80d58

…r-overhead Signed-off-by: Qifan Deng <[email protected]>

Add standard deviation in utils

18d3075

Signed-off-by: Qifan Deng <[email protected]>

Add stddev for prefill overhead and kvcache trans overhead

1fd0a9a

Signed-off-by: Qifan Deng <[email protected]>

Fix test condition when remove p/d is enabled and in-place policy is …

1e8f33d

…used Signed-off-by: Qifan Deng <[email protected]>

pancak3 force-pushed the dev/prefill-overhead branch from 9b1b5ab to 1e8f33d Compare August 29, 2025 09:43

pancak3 and others added 6 commits August 29, 2025 21:47

Use simplfied implementation of ttft

dff8d3d

Signed-off-by: Qifan Deng <[email protected]>

Add sep lines in readme params

0910dbf

Signed-off-by: Qifan Deng <[email protected]>

Update readme with explanation of new ttft

5f9fe1b

Signed-off-by: Qifan Deng <[email protected]>

Fix ttft new params tests

049c10e

Signed-off-by: Qifan Deng <[email protected]>

Fix kv cache trasfer tests and impl

9886b94

Signed-off-by: Qifan Deng <[email protected]>

Merge branch 'main' into dev/prefill-overhead

4ae89f2

Signed-off-by: Qifan Deng <[email protected]>

pancak3 marked this pull request as ready for review August 29, 2025 14:20

pancak3 added 2 commits August 29, 2025 22:33

Fix invalid config test of new ttft params

904e18d

Signed-off-by: Qifan Deng <[email protected]>

Revert "Add standard deviation in utils"

4078dbd

This reverts commit 18d3075. Signed-off-by: Qifan Deng <[email protected]>

pancak3 force-pushed the dev/prefill-overhead branch from 64830dc to 4078dbd Compare August 29, 2025 14:38

mayabar requested changes Aug 31, 2025

View reviewed changes

Remove additional variables in prefill time calculation

91e702f

Signed-off-by: Qifan Deng <[email protected]>

pancak3 added 2 commits August 31, 2025 22:24

Improve is remote prefill/decode interface doc

8430ea3

Signed-off-by: Qifan Deng <[email protected]>

Improve implementation of ttft calc

a5305c8

Signed-off-by: Qifan Deng <[email protected]>

pancak3 requested a review from mayabar August 31, 2025 14:43

Remove unnecessary variable

b74b3aa

Signed-off-by: Qifan Deng <[email protected]>

mayabar approved these changes Sep 1, 2025

View reviewed changes

github-actions bot added the lgtm label Sep 1, 2025

github-actions bot approved these changes Sep 1, 2025

View reviewed changes

mayabar merged commit 08d4613 into llm-d:main Sep 1, 2025
4 checks passed

Change time-to-first-token parameter to be based on number of request tokens #137 #165

Change time-to-first-token parameter to be based on number of request tokens #137 #165

Uh oh!

Conversation

pancak3 commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayabar commented Aug 25, 2025

Uh oh!

pancak3 commented Aug 28, 2025

Uh oh!

pancak3 commented Aug 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayabar commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pancak3 commented Aug 24, 2025 •

edited

Loading