Skip to content

Adjust request "processing time" to current loadΒ #157

@mayabar

Description

@mayabar

Current state:
Request processing time is composed of the prefill and decode parts.
If prefill is executed on the same instance, its time can be defined by one of the following options:

  • time-to-first-token (TTFT) parameter: represents prefill time independent of prompt size
  • prefill-overhead + prefill-time-per-token (PTPT) * number-of-propmpt-tokens: prefill time which takes into account the prompt length
    In case of disaggregated prefill (prefill executed on a different pod), the time is calculated as kv-cache-transfer-latency * number-of-propmpt-tokens

Decode time is calculated as:
inter-token-latency * number-of-propmpt-tokens

What would you like to be added:
Allow support to adjust all the above timeouts based on the current load (i.e., the number of requests currently running in parallel)
Introduce a new configuration parameter load-factor which defines the multiplier for all time delays when the load is at maximum.
All time configuration values should be calculated using linear interpolation between the number of requests [1-max-num-seq] and the time range [time-defined-in-config, time-defined-in-config*load-factor]

The formula:
time = time_defined_in_config * [1 + (load_factor - 1) * (current_requests_num - 1) / (max_num_seq - 1)]

Why is this needed:
The real vLLM's performance depends on the current load. The simulator should mimic this behavior for realistic results.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions