Skip to content

Improve time delays calculation based on load #159

@mayabar

Description

@mayabar

Start work on this after implementing issue #157

What would you like to be added:
Request processing time is calculated when request arrives (for non-streaming requests) and uses without adjusting during request "processing"
For example request arrives when no other requests are in process, it processing time is calculated by ttft + inter-token-latency*num-of-tokens. If during the "processing time" more request arrived - the inter-token-latency should become higher.
Convert all sleep commands to "active wait", means wait time for each token independently, for each token get timeout to be used which is based on the current load.

Image

Why is this needed:
To mimic vLLM behavior in more realistic way.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions