Skip to content

Conversation

@pancak3
Copy link
Contributor

@pancak3 pancak3 commented Aug 24, 2025

resolve #137

  • new params:
    • prefill-overhead: The base overhead in milliseconds for prefilling a single token. This value, in conjunction with prefill-complexity and prefill-overhead-std-dev, determines the overall time taken to prefill the entire context. It's an optional parameter with a default of 0 and is ignored if time-to-first-token is not 0.
    • prefill-complexity: Defines how the prefill time scales with the number of prompt tokens. This is required if prefill-overhead is used. Options are "n^2" and "nlog(n)", with a default of "n^2".
    • prefill-overhead-std-dev: The standard deviation in milliseconds for the time taken before the first token is returned. This is required if prefill-overhead is used, with a default of 0.
    • kv-cache-transfer-overhead: The base overhead in milliseconds for transferring the KV-cache of a single token from another vLLM instance when P/D is activated. Along with kv-cache-transfer-complexity and kv-cache-transfer-overhead-std-dev, it defines the total time for the KV-cache transfer of the entire context. This parameter is optional with a default of 0 and is ignored if kv-cache-transfer-latency is not 0.
    • kv-cache-transfer-complexity: The complexity of the KV-cache transfer relative to the number of prompt tokens. This is required if kv-cache-transfer-overhead is used. Options are "linear" and "in-place", with a default of "linear".
    • kv-cache-transfer-overhead-std-dev: The standard deviation in milliseconds for the time taken to transfer the KV-cache. This is required if kv-cache-transfer-overhead is used, with a default of 0.- [x] time-to-first-token and prefill-overhead coexist
  • time-to-first-token overrides prefill-overhead (forward compatibility)
  • test cases
  • docs and comments
  • ? consider kv-cache-size < number of prompt tokens in remote prefill senarios

@pancak3 pancak3 marked this pull request as draft August 24, 2025 14:47
@pancak3 pancak3 changed the base branch from main to dev August 25, 2025 05:20
@pancak3 pancak3 changed the base branch from dev to main August 25, 2025 05:21
@pancak3 pancak3 force-pushed the dev/prefill-overhead branch from 87ae0a3 to 9da55d3 Compare August 25, 2025 11:53
@pancak3 pancak3 marked this pull request as ready for review August 25, 2025 11:58
@mayabar
Copy link
Collaborator

mayabar commented Aug 25, 2025

@pancak3 thanks for working on this issue.

I ave some general questions:

  • Dependency between prefill time and number of input tokens. In the sources I read I saw that for not too long requests the dependency is "near-linear". Prefill-time=C+n*t, where C is fixed overhead and t is prefill time per token, and n is the input token count. Do you have evidence for quadratical or logarithmic dependency?

  • Regarding kv-cache time per token, currently it is used only when prefill is executed on another vllm pod, and mimics time for transferring kv-cache to the current pod. We added recently ability to cache kv-blocks (it was done to be able to send kv events properly), in future we want for each request check if its prompt has part in kv-cache and calculate prefill time accordingly, but this is in a separate stage.

  • In the issue associated with this PR we talked about prefill-time-per-token (PTPT), in addition, we talked about prefill-overhead that is the contain time which does not depends on number of input tokens, in case the time dependency is linear the formula will be: total_prefill_time = prefill-overhead + num-of-tokens * PTPT.

@pancak3
Copy link
Contributor Author

pancak3 commented Aug 28, 2025

@mayabar sorry to get back late.

  • Dependency between prefill time and number of input tokens. In the sources I read I saw that for not too long requests the dependency is "near-linear". Prefill-time=C+n*t, where C is fixed overhead and t is prefill time per token, and n is the input token count. Do you have evidence for quadratical or logarithmic dependency?
  • In the issue associated with this PR we talked about prefill-time-per-token (PTPT), in addition, we talked about prefill-overhead that is the contain time which does not depends on number of input tokens, in case the time dependency is linear the formula will be: total_prefill_time = prefill-overhead + num-of-tokens * PTPT.

I have to admit I haven't read deeply to a source but implement the previous code by impression of LLMs. Got some time reading the paper Attention Is All You Need,

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$

Convert it to computation complexity, my understanding is that, it has $O(n^2d/k)$ complexity, where n is the length of input prompts, d can simply be the model complexity, and k is the GPU performance (parallelism units). When n is less than k, the complexity could be simplified from quadratics $O(n^2d/k)$ to "near-linear" $O(nd)$.
As we are simulating, at this stage (after reading roadmap P1 plan), I agree with the simple implementation (Prefill-time=C+n*t).
But I am open to working on features for more flexible configuration ( $d=f(d_{tokenization}, d_{embedding}, d_k, d_v)$, consider k). Let me know if further discussion is needed.

  • Regarding kv-cache time per token, currently it is used only when prefill is executed on another vllm pod, and mimics time for transferring kv-cache to the current pod. We added recently ability to cache kv-blocks (it was done to be able to send kv events properly), in future we want for each request check if its prompt has part in kv-cache and calculate prefill time accordingly, but this is in a separate stage.

Lovely, I will change kv cache transfer implementation, removing the "complexity" part. The calculation isKV-cache-transfer-time = n * transfer-overhead and transfer-overhead will be the argument. Let me know if kv-cache-tranfer-overhead or transfter-overhead or any other naming is preferred.

Also, I am not sure whether the network quality between pods is considered in the following phases (did not find any clue in the public roadmap doc), but personally think it is good to consider networking in the sim behavior.

One more thing to confirm, for both Prefill-time and KV-cache-transfer-time, they will be used as mean, along with their own std-dev, to get normalzed times for "P1 - Timing randomization using statistical parameters". Thank you!

@pancak3 pancak3 marked this pull request as draft August 29, 2025 09:25
@pancak3 pancak3 force-pushed the dev/prefill-overhead branch from 9b1b5ab to 1e8f33d Compare August 29, 2025 09:43
@pancak3 pancak3 marked this pull request as ready for review August 29, 2025 14:20
@pancak3
Copy link
Contributor Author

pancak3 commented Aug 29, 2025

Hi @mayabar, sorry to block other issues, pelase review the latest change, thanks!

@pancak3 pancak3 force-pushed the dev/prefill-overhead branch from 64830dc to 4078dbd Compare August 29, 2025 14:38
if !doRemotePrefill {
mean := float64(s.config.TimeToFirstToken)
stddev := float64(s.config.TimeToFirstTokenStdDev)
return int(common.RandomNorm(mean, stddev))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may stay the old code where mean and stddev are calculated according the doRemotePrefill and randomInt is called from one plase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is resolved in the below change

// calc the prefill overhead against number of tokens
func (s *VllmSimulator) calcPrefillOverhead(nPromptTokens int, doRemotePrefill bool) int {
if !doRemotePrefill {
constOverhead := s.config.PrefillOverhead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional variables are not required, we can use config properties directly in the calculation or function call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return int(common.RandomNorm(float64(prefillTime), float64(stdDev)))
}

if s.config.KVCacheTransferLatency != 0 || s.config.KVCacheTransferLatencyStdDev != 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check should be outside this function since it called only if time-to-first-token is defined.
in the current logic it's not possible to use for requests without disaggregated PD constant time for prefill (time-to-first-token is not 0) but in case of PD - calculate prefill time based on number of tokens

I suggest to check first doRemotePrefill and take care of each time calculation separately.
Like:

  • if PD
    • if time depends on input length ---> prefill-time=kvcache-transfer * num-of-tokens
    • if time does not depends on input length ---> prefill-time=kvcache-tranfer-time
  • if not PD
    • if time depends on input length ---> prefill-time=overhead + ptpt * num-of-tokens
    • if time does not depends on input length ---> prefill-time=time-to-first-token

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this clear suggestion! the logic is now pretty intuitive, please review

// GetMaxCompletionTokens returns the maximum completion tokens requested
GetMaxCompletionTokens() *int64
// IsDoRemoteDecode() returns true if do_remote_decode field is true in the request, this means that this is prefill request
// IsDoRemoteDecode() returns true if do_remote_decode field is true in the request, this means that this is decode request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_remote_prefill=true in the request's payload means that for this request the prefill stage was done on a remote pod and the current simulator instance should process only the decode part (kv-cache copies from another pod)

do_remote_decode=true in the request's payload means that for this request the decode stage will be done on a remote pod and the current simulator instance should process only the prefill part

if both properties are false - both prefill and decode are executed locally

This is a little bit confusing but the comment is according the NIXL documentation ;) Maybe it worth to add longer explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was caught : D
I added some words, please review, thanks!

@pancak3 pancak3 requested a review from mayabar August 31, 2025 14:43
@mayabar
Copy link
Collaborator

mayabar commented Sep 1, 2025

/lgtm

/approve

@github-actions github-actions bot added the lgtm label Sep 1, 2025
@mayabar mayabar merged commit 08d4613 into llm-d:main Sep 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change time-to-first-token parameter to be based on number of request tokens

2 participants