You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Change time-to-first-token parameter to be based on number of request tokens #137 (#165)
* Fix comments on prefill arg in completion request interface
Signed-off-by: Qifan Deng <[email protected]>
* Add feature of calc ttft by prefill overhead. TODO: kvcache transfer overhead
Signed-off-by: Qifan Deng <[email protected]>
* Rename prefill-overhead-complexity to prefill-complexity
Signed-off-by: Qifan Deng <[email protected]>
* Calc kv cache transfer overhead based on prompt length
Signed-off-by: Qifan Deng <[email protected]>
* Add invalid test cases for args prefill-overhead and kv-cache-transfer-overhead
Signed-off-by: Qifan Deng <[email protected]>
* Add standard deviation in utils
Signed-off-by: Qifan Deng <[email protected]>
* Add stddev for prefill overhead and kvcache trans overhead
Signed-off-by: Qifan Deng <[email protected]>
* Fix test condition when remove p/d is enabled and in-place policy is used
Signed-off-by: Qifan Deng <[email protected]>
* Use simplfied implementation of ttft
Signed-off-by: Qifan Deng <[email protected]>
* Add sep lines in readme params
Signed-off-by: Qifan Deng <[email protected]>
* Update readme with explanation of new ttft
Signed-off-by: Qifan Deng <[email protected]>
* Fix ttft new params tests
Signed-off-by: Qifan Deng <[email protected]>
* Fix kv cache trasfer tests and impl
Signed-off-by: Qifan Deng <[email protected]>
* Fix invalid config test of new ttft params
Signed-off-by: Qifan Deng <[email protected]>
* Revert "Add standard deviation in utils"
This reverts commit 18d3075.
Signed-off-by: Qifan Deng <[email protected]>
* Remove additional variables in prefill time calculation
Signed-off-by: Qifan Deng <[email protected]>
* Improve is remote prefill/decode interface doc
Signed-off-by: Qifan Deng <[email protected]>
* Improve implementation of ttft calc
Signed-off-by: Qifan Deng <[email protected]>
* Remove unnecessary variable
Signed-off-by: Qifan Deng <[email protected]>
---------
Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -101,13 +101,22 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
101
101
-`mode`: the simulator mode, optional, by default `random`
102
102
-`echo`: returns the same text that was sent in the request
103
103
-`random`: returns a sentence chosen at random from a set of pre-defined sentences
104
+
---
104
105
-`time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
105
106
-`time-to-first-token-std-dev`: standard deviation for time before the first token will be returned, in milliseconds, optional, default is 0, can't be more than 30% of `time-to-first-token`, will not cause the actual time to first token to differ by more than 70% from `time-to-first-token`
106
107
-`inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
107
108
-`inter-token-latency-std-dev`: standard deviation for time between generated tokens, in milliseconds, optional, default is 0, can't be more than 30% of `inter-token-latency`, will not cause the actual inter token latency to differ by more than 70% from `inter-token-latency`
108
109
-`kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
109
110
-`kv-cache-transfer-latency-std-dev`: standard deviation for time to "transfer" kv-cache from another vLLM instance in case P/D is activated, in milliseconds, optional, default is 0, can't be more than 30% of `kv-cache-transfer-latency`, will not cause the actual latency to differ by more than 70% from `kv-cache-transfer-latency`
111
+
---
112
+
-`prefill-overhead`: constant overhead time for prefill (in milliseconds), optional, by default zero, used in calculating time to first token, this will be ignored if `time-to-first-token` is not `0`
113
+
-`prefill-time-per-token`: time taken to generate each token during prefill (in milliseconds), optional, by default zero, this will be ignored if `time-to-first-token` is not `0`
114
+
-`prefill-time-std-dev`: similar to `time-to-first-token-std-dev`, but is applied on the final prefill time, which is calculated by `prefill-overhead`, `prefill-time-per-token`, and number of prompt tokens, this will be ignored if `time-to-first-token` is not `0`
115
+
-`kv-cache-transfer-time-per-token`: time taken to transfer cache for each token in case P/D is enabled (in milliseconds), optional, by default zero, this will be ignored if `kv-cache-transfer-latency` is not `0`
116
+
-`kv-cache-transfer-time-std-dev`: similar to `time-to-first-token-std-dev`, but is applied on the final kv cache transfer time in case P/D is enabled (in milliseconds), which is calculated by `kv-cache-transfer-time-per-token` and number of prompt tokens, this will be ignored if `kv-cache-transfer-latency` is not `0`
117
+
---
110
118
-`seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
119
+
---
111
120
-`max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
112
121
-`min-tool-call-integer-param`: the minimum possible value of integer parameters in a tool call, optional, defaults to 0
113
122
-`max-tool-call-number-param`: the maximum possible value of number (float) parameters in a tool call, optional, defaults to 100
@@ -116,6 +125,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
116
125
-`min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
117
126
-`tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
118
127
-`object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
128
+
---
119
129
-`enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
120
130
-`kv-cache-size`: the maximum number of token blocks in kv cache
121
131
-`block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
@@ -124,8 +134,10 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
124
134
-`zmq-endpoint`: ZMQ address to publish events
125
135
-`zmq-max-connect-attempts`: the maximum number of ZMQ connection attempts, defaults to 0, maximum: 10
126
136
-`event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
137
+
---
127
138
-`failure-injection-rate`: probability (0-100) of injecting failures, optional, default is 0
128
139
-`failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
140
+
---
129
141
-`fake-metrics`: represents a predefined set of metrics to be sent to Prometheus as a substitute for the real metrics. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together. The set should include values for
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences")
434
469
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
435
470
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
471
+
472
+
f.IntVar(&config.PrefillOverhead, "prefill-overhead", config.PrefillOverhead, "Time to prefill in milliseconds. This argument is ignored if <time-to-first-token> is not 0.")
473
+
f.IntVar(&config.PrefillTimePerToken, "prefill-time-per-token", config.PrefillTimePerToken, "Time to prefill per token (in milliseconds)")
474
+
f.IntVar(&config.PrefillTimeStdDev, "prefill-time-std-dev", config.PrefillTimeStdDev, "Standard deviation for time to prefill (in milliseconds)")
475
+
f.IntVar(&config.KVCacheTransferTimePerToken, "kv-cache-transfer-time-per-token", config.KVCacheTransferTimePerToken, "Time for KV-cache transfer per token from a remote vLLM (in milliseconds)")
476
+
f.IntVar(&config.KVCacheTransferTimeStdDev, "kv-cache-transfer-time-std-dev", config.KVCacheTransferTimeStdDev, "Standard deviation for time for KV-cache transfer per token from a remote vLLM (in milliseconds)")
477
+
436
478
f.IntVar(&config.KVCacheTransferLatency, "kv-cache-transfer-latency", config.KVCacheTransferLatency, "Time for KV-cache transfer from a remote vLLM (in milliseconds)")
437
479
f.IntVar(&config.InterTokenLatencyStdDev, "inter-token-latency-std-dev", config.InterTokenLatencyStdDev, "Standard deviation for time between generated tokens (in milliseconds)")
438
480
f.IntVar(&config.TimeToFirstTokenStdDev, "time-to-first-token-std-dev", config.TimeToFirstTokenStdDev, "Standard deviation for time before the first token will be returned (in milliseconds)")
0 commit comments