- 
                Notifications
    You must be signed in to change notification settings 
- Fork 37
Closed
Milestone
Description
Start work on this after implementing issue #157
What would you like to be added:
Request processing time is calculated when request arrives (for non-streaming requests) and uses without adjusting during request "processing"
For example request arrives when no other requests are in process, it processing time is calculated by ttft + inter-token-latency*num-of-tokens. If during the "processing time" more request arrived - the inter-token-latency should become higher.
Convert all sleep commands to "active wait", means wait time for each token independently, for each token get timeout to be used which is based on the current load.
 
Why is this needed:
To mimic vLLM behavior in more realistic way.
Metadata
Metadata
Assignees
Labels
No labels