You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -93,30 +93,33 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
93
93
-`port`: the port the simulator listents on, default is 8000
94
94
-`model`: the currently 'loaded' model, mandatory
95
95
-`served-model-name`: model names exposed by the API (a list of space-separated strings)
96
+
---
96
97
-`lora-modules`: a list of LoRA adapters (a list of space-separated JSON strings): '{"name": "name", "path": "lora_path", "base_model_name": "id"}', optional, empty by default
97
98
-`max-loras`: maximum number of LoRAs in a single batch, optional, default is one
98
99
-`max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max-loras, default is max-loras
100
+
---
99
101
-`max-model-len`: model's context window, maximum number of tokens in a single request including input and output, optional, default is 1024
100
102
-`max-num-seqs`: maximum number of sequences per iteration (maximum number of inference requests that could be processed at the same time), default is 5
101
103
-`mode`: the simulator mode, optional, by default `random`
102
104
-`echo`: returns the same text that was sent in the request
103
105
-`random`: returns a sentence chosen at random from a set of pre-defined sentences
106
+
---
104
107
-`time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
105
108
-`time-to-first-token-std-dev`: standard deviation for time before the first token will be returned, in milliseconds, optional, default is 0, can't be more than 30% of `time-to-first-token`, will not cause the actual time to first token to differ by more than 70% from `time-to-first-token`
106
109
-`inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
107
110
-`inter-token-latency-std-dev`: standard deviation for time between generated tokens, in milliseconds, optional, default is 0, can't be more than 30% of `inter-token-latency`, will not cause the actual inter token latency to differ by more than 70% from `inter-token-latency`
108
111
-`kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
109
112
-`kv-cache-transfer-latency-std-dev`: standard deviation for time to "transfer" kv-cache from another vLLM instance in case P/D is activated, in milliseconds, optional, default is 0, can't be more than 30% of `kv-cache-transfer-latency`, will not cause the actual latency to differ by more than 70% from `kv-cache-transfer-latency`
113
+
---
110
114
-`prefill-overhead`: The base overhead in milliseconds for prefilling a single token. This value, in conjunction with `prefill-complexity` and `prefill-overhead-std-dev`, determines the overall time taken to prefill the entire context. It's an optional parameter with a default of `0` and is ignored if `time-to-first-token` is not `0`.
111
115
-`prefill-complexity`: Defines how the prefill time scales with the number of prompt tokens. This is required if `prefill-overhead` is used. Options are `"n^2"` and `"nlog(n)"`, with a default of `"n^2"`.
112
116
-`prefill-overhead-std-dev`: The standard deviation in milliseconds for the time taken before the first token is returned. This is required if `prefill-overhead` is used, with a default of `0`.
113
-
114
-
---
115
-
116
117
-`kv-cache-transfer-overhead`: The base overhead in milliseconds for transferring the KV-cache of a single token from another vLLM instance when P/D is activated. Along with `kv-cache-transfer-complexity` and `kv-cache-transfer-overhead-std-dev`, it defines the total time for the KV-cache transfer of the entire context. This parameter is optional with a default of `0` and is ignored if `kv-cache-transfer-latency` is not `0`.
117
118
-`kv-cache-transfer-complexity`: The complexity of the KV-cache transfer relative to the number of prompt tokens. This is required if `kv-cache-transfer-overhead` is used. Options are `"linear"` and `"in-place"`, with a default of `"linear"`.
118
119
-`kv-cache-transfer-overhead-std-dev`: The standard deviation in milliseconds for the time taken to transfer the KV-cache. This is required if `kv-cache-transfer-overhead` is used, with a default of `0`.
120
+
---
119
121
-`seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
122
+
---
120
123
-`max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
121
124
-`min-tool-call-integer-param`: the minimum possible value of integer parameters in a tool call, optional, defaults to 0
122
125
-`max-tool-call-number-param`: the maximum possible value of number (float) parameters in a tool call, optional, defaults to 100
@@ -125,6 +128,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
125
128
-`min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
126
129
-`tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
127
130
-`object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
131
+
---
128
132
-`enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
129
133
-`kv-cache-size`: the maximum number of token blocks in kv cache
130
134
-`block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
@@ -133,6 +137,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
133
137
-`zmq-endpoint`: ZMQ address to publish events
134
138
-`zmq-max-connect-attempts`: the maximum number of ZMQ connection attempts, defaults to 0, maximum: 10
135
139
-`event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
140
+
---
136
141
-`fake-metrics`: represents a predefined set of metrics to be sent to Prometheus as a substitute for the real metrics. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together. The set should include values for
0 commit comments