Adjust request "processing time" to current load

**Current state**:
Request processing time is composed of the prefill and decode parts. 
If prefill is executed on the same instance, its time can be defined by one of the following options:
- `time-to-first-token (TTFT)` parameter: represents prefill time independent of prompt size
- `prefill-overhead` + `prefill-time-per-token (PTPT)` * `number-of-propmpt-tokens`: prefill time which takes into account the prompt length
In case of disaggregated prefill (prefill executed on a different pod), the time is calculated as `kv-cache-transfer-latency` * `number-of-propmpt-tokens`

Decode time is calculated as:
`inter-token-latency` * `number-of-propmpt-tokens`

**What would you like to be added**:
Allow support to adjust all the above timeouts based on the current load (i.e., the number of requests currently running in parallel)
Introduce a new configuration parameter `load-factor` which defines the multiplier for all time delays when the load is at maximum. 
All time configuration values should be calculated using linear interpolation between the number of requests [1-max-num-seq] and the time range [time-defined-in-config, time-defined-in-config*load-factor]

The formula:
time = time_defined_in_config * [1 + (load_factor - 1) * (current_requests_num - 1) / (max_num_seq - 1)]

**Why is this needed**:
The real vLLM's performance depends on the current load. The simulator should mimic this behavior for realistic results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust request "processing time" to current load #157

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adjust request "processing time" to current load #157

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions