- 
                Notifications
    You must be signed in to change notification settings 
- Fork 37
Change time-to-first-token parameter to be based on number of request tokens #137 #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
87ae0a3    to
    9da55d3      
    Compare
  
    | @pancak3 thanks for working on this issue. I ave some general questions: 
 | 
| @mayabar sorry to get back late. 
 I have to admit I haven't read deeply to a source but implement the previous code by impression of LLMs. Got some time reading the paper Attention Is All You Need, Convert it to computation complexity, my understanding is that, it has  
 Lovely, I will change kv cache transfer implementation, removing the "complexity" part. The calculation is Also, I am not sure whether the network quality between pods is considered in the following phases (did not find any clue in the public roadmap doc), but personally think it is good to consider networking in the sim behavior. One more thing to confirm, for both  | 
Signed-off-by: Qifan Deng <[email protected]>
…overhead Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
…r-overhead Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
…used Signed-off-by: Qifan Deng <[email protected]>
9b1b5ab    to
    1e8f33d      
    Compare
  
    Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
| Hi @mayabar, sorry to block other issues, pelase review the latest change, thanks! | 
Signed-off-by: Qifan Deng <[email protected]>
This reverts commit 18d3075. Signed-off-by: Qifan Deng <[email protected]>
64830dc    to
    4078dbd      
    Compare
  
            
          
                pkg/llm-d-inference-sim/simulator.go
              
                Outdated
          
        
      | if !doRemotePrefill { | ||
| mean := float64(s.config.TimeToFirstToken) | ||
| stddev := float64(s.config.TimeToFirstTokenStdDev) | ||
| return int(common.RandomNorm(mean, stddev)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may stay the old code where mean and stddev are calculated according the doRemotePrefill and randomInt is called from one plase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is resolved in the below change
        
          
                pkg/llm-d-inference-sim/simulator.go
              
                Outdated
          
        
      | // calc the prefill overhead against number of tokens | ||
| func (s *VllmSimulator) calcPrefillOverhead(nPromptTokens int, doRemotePrefill bool) int { | ||
| if !doRemotePrefill { | ||
| constOverhead := s.config.PrefillOverhead | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional variables are not required, we can use config properties directly in the calculation or function call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
        
          
                pkg/llm-d-inference-sim/simulator.go
              
                Outdated
          
        
      | return int(common.RandomNorm(float64(prefillTime), float64(stdDev))) | ||
| } | ||
|  | ||
| if s.config.KVCacheTransferLatency != 0 || s.config.KVCacheTransferLatencyStdDev != 0 { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check should be outside this function since it called only if time-to-first-token is defined.
in the current logic it's not possible to use for requests without disaggregated PD constant time for prefill (time-to-first-token is not 0) but in case of PD - calculate prefill time based on number of tokens
I suggest to check first doRemotePrefill and take care of each time calculation separately.
Like:
- if PD
- if time depends on input length ---> prefill-time=kvcache-transfer * num-of-tokens
- if time does not depends on input length ---> prefill-time=kvcache-tranfer-time
 
- if not PD
- if time depends on input length ---> prefill-time=overhead + ptpt * num-of-tokens
- if time does not depends on input length ---> prefill-time=time-to-first-token
 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this clear suggestion! the logic is now pretty intuitive, please review
        
          
                pkg/openai-server-api/request.go
              
                Outdated
          
        
      | // GetMaxCompletionTokens returns the maximum completion tokens requested | ||
| GetMaxCompletionTokens() *int64 | ||
| // IsDoRemoteDecode() returns true if do_remote_decode field is true in the request, this means that this is prefill request | ||
| // IsDoRemoteDecode() returns true if do_remote_decode field is true in the request, this means that this is decode request | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_remote_prefill=true in the request's payload means that for this request the prefill stage was done on a remote pod and the current simulator instance should process only the decode part (kv-cache copies from another pod)
do_remote_decode=true in the request's payload means that for this request the decode stage will be done on a remote pod and the current simulator instance should process only the prefill part
if both properties are false - both prefill and decode are executed locally
This is a little bit confusing but the comment is according the NIXL documentation ;) Maybe it worth to add longer explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was caught : D
I added some words, please review, thanks!
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
| /lgtm /approve | 
resolve #137
new params:prefill-overhead: The base overhead in milliseconds for prefilling a single token. This value, in conjunction withprefill-complexityandprefill-overhead-std-dev, determines the overall time taken to prefill the entire context. It's an optional parameter with a default of0and is ignored iftime-to-first-tokenis not0.prefill-complexity: Defines how the prefill time scales with the number of prompt tokens. This is required ifprefill-overheadis used. Options are"n^2"and"nlog(n)", with a default of"n^2".prefill-overhead-std-dev: The standard deviation in milliseconds for the time taken before the first token is returned. This is required ifprefill-overheadis used, with a default of0.kv-cache-transfer-overhead: The base overhead in milliseconds for transferring the KV-cache of a single token from another vLLM instance when P/D is activated. Along withkv-cache-transfer-complexityandkv-cache-transfer-overhead-std-dev, it defines the total time for the KV-cache transfer of the entire context. This parameter is optional with a default of0and is ignored ifkv-cache-transfer-latencyis not0.kv-cache-transfer-complexity: The complexity of the KV-cache transfer relative to the number of prompt tokens. This is required ifkv-cache-transfer-overheadis used. Options are"linear"and"in-place", with a default of"linear".kv-cache-transfer-overhead-std-dev: The standard deviation in milliseconds for the time taken to transfer the KV-cache. This is required ifkv-cache-transfer-overheadis used, with a default of0.- [x]time-to-first-tokenandprefill-overheadcoexisttime-to-first-tokenoverridesprefill-overhead(forward compatibility)test casesdocs and comments? considerkv-cache-size< number of prompt tokens in remote prefill senarios