- 
                Notifications
    You must be signed in to change notification settings 
- Fork 37
Description
Current state:
Request processing time is composed of the prefill and decode parts.
If prefill is executed on the same instance, its time can be defined by one of the following options:
- time-to-first-token (TTFT)parameter: represents prefill time independent of prompt size
- prefill-overhead+- prefill-time-per-token (PTPT)*- number-of-propmpt-tokens: prefill time which takes into account the prompt length
 In case of disaggregated prefill (prefill executed on a different pod), the time is calculated as- kv-cache-transfer-latency*- number-of-propmpt-tokens
Decode time is calculated as:
inter-token-latency * number-of-propmpt-tokens
What would you like to be added:
Allow support to adjust all the above timeouts based on the current load (i.e., the number of requests currently running in parallel)
Introduce a new configuration parameter load-factor which defines the multiplier for all time delays when the load is at maximum.
All time configuration values should be calculated using linear interpolation between the number of requests [1-max-num-seq] and the time range [time-defined-in-config, time-defined-in-config*load-factor]
The formula:
time = time_defined_in_config * [1 + (load_factor - 1) * (current_requests_num - 1) / (max_num_seq - 1)]
Why is this needed:
The real vLLM's performance depends on the current load. The simulator should mimic this behavior for realistic results.