You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adjust request "processing time" to current load (#189)
* Validate max-num-seqs
Signed-off-by: Qifan Deng <[email protected]>
* Validate PrefillTimeStdDev
Signed-off-by: Qifan Deng <[email protected]>
* Add param time-factor-under-load
Signed-off-by: Qifan Deng <[email protected]>
* The factor applies on time-to-first-token
Signed-off-by: Qifan Deng <[email protected]>
* Test TTFT when partially loaded
Signed-off-by: Qifan Deng <[email protected]>
* Apply time factor under load to prefill and inter token latency
Signed-off-by: Qifan Deng <[email protected]>
* Improve param desc
Signed-off-by: Qifan Deng <[email protected]>
* Use nRunningReqs instead of runReqChan
Signed-off-by: Qifan Deng <[email protected]>
* unstage manifests/dev-config.yaml
Signed-off-by: Qifan Deng <[email protected]>
* Update readme
Signed-off-by: Qifan Deng <[email protected]>
* Restore changes for inter token latency (lost due to conflicts resolve)
Signed-off-by: Qifan Deng <[email protected]>
* Calc inter token latency based on load instead of one-calc-for-whole request
Signed-off-by: Qifan Deng <[email protected]>
* Calc inter token latency based on load instead of one-calc-for-whole request
Signed-off-by: Qifan Deng <[email protected]>
* Move methods to simulator
Signed-off-by: Qifan Deng <[email protected]>
* Rename helper func
Signed-off-by: Qifan Deng <[email protected]>
* Rename helper func
Signed-off-by: Qifan Deng <[email protected]>
* Fix inter token latency test
Signed-off-by: Qifan Deng <[email protected]>
---------
Signed-off-by: Qifan Deng <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,6 +115,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
115
115
-`kv-cache-transfer-time-per-token`: time taken to transfer cache for each token in case P/D is enabled (in milliseconds), optional, by default zero, this will be ignored if `kv-cache-transfer-latency` is not `0`
116
116
-`kv-cache-transfer-time-std-dev`: similar to `time-to-first-token-std-dev`, but is applied on the final kv cache transfer time in case P/D is enabled (in milliseconds), which is calculated by `kv-cache-transfer-time-per-token` and number of prompt tokens, this will be ignored if `kv-cache-transfer-latency` is not `0`
117
117
---
118
+
-`time-factor-under-load`: a multiplicative factor that affects the overall time taken for requests when parallelrequests are being processed. The value of this factor must be >= 1.0, with a default of 1.0. If this factor is 1.0, no extra time is added. When the factor is x (where x > 1.0) and there are `max-num-seqs` requests, the total time will be multiplied by x. The extra time then decreases multiplicatively to 1.0 when the number of requests is less than MaxNumSeqs.
118
119
-`seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
119
120
---
120
121
-`max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
f.IntVar(&config.TimeToFirstTokenStdDev, "time-to-first-token-std-dev", config.TimeToFirstTokenStdDev, "Standard deviation for time before the first token will be returned (in milliseconds)")
503
523
f.IntVar(&config.KVCacheTransferLatencyStdDev, "kv-cache-transfer-latency-std-dev", config.KVCacheTransferLatencyStdDev, "Standard deviation for time for KV-cache transfer from a remote vLLM (in milliseconds)")
504
524
f.Int64Var(&config.Seed, "seed", config.Seed, "Random seed for operations (if not set, current Unix time in nanoseconds is used)")
525
+
f.Float64Var(&config.TimeFactorUnderLoad, "time-factor-under-load", config.TimeFactorUnderLoad, "Time factor under load (must be >= 1.0)")
505
526
506
527
f.IntVar(&config.MaxToolCallIntegerParam, "max-tool-call-integer-param", config.MaxToolCallIntegerParam, "Maximum possible value of integer parameters in a tool call")
507
528
f.IntVar(&config.MinToolCallIntegerParam, "min-tool-call-integer-param", config.MinToolCallIntegerParam, "Minimum possible value of integer parameters in a tool call")
// createModelsResponse creates and returns ModelResponse for the current state, returned array of models contains the base model + LoRA adapters if exist
DescribeTable("when time-factor-under-load is > 1, and the sim is fully loaded, the time to first token should be time-factor-under-load * time-to-first-token",
DescribeTable("when time-factor-under-load is > 1, and the sim is partially loaded, the time to first token should be linear interpolation between time-to-first-token and time-factor-under-load * time-to-first-token",
0 commit comments