You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-2Lines changed: 16 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -93,21 +93,33 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
93
93
-`port`: the port the simulator listents on, default is 8000
94
94
-`model`: the currently 'loaded' model, mandatory
95
95
-`served-model-name`: model names exposed by the API (a list of space-separated strings)
96
+
---
96
97
-`lora-modules`: a list of LoRA adapters (a list of space-separated JSON strings): '{"name": "name", "path": "lora_path", "base_model_name": "id"}', optional, empty by default
97
98
-`max-loras`: maximum number of LoRAs in a single batch, optional, default is one
98
99
-`max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max-loras, default is max-loras
100
+
---
99
101
-`max-model-len`: model's context window, maximum number of tokens in a single request including input and output, optional, default is 1024
100
102
-`max-num-seqs`: maximum number of sequences per iteration (maximum number of inference requests that could be processed at the same time), default is 5
101
103
-`mode`: the simulator mode, optional, by default `random`
102
104
-`echo`: returns the same text that was sent in the request
103
105
-`random`: returns a sentence chosen at random from a set of pre-defined sentences
106
+
---
104
107
-`time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
105
108
-`time-to-first-token-std-dev`: standard deviation for time before the first token will be returned, in milliseconds, optional, default is 0, can't be more than 30% of `time-to-first-token`, will not cause the actual time to first token to differ by more than 70% from `time-to-first-token`
106
109
-`inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
107
110
-`inter-token-latency-std-dev`: standard deviation for time between generated tokens, in milliseconds, optional, default is 0, can't be more than 30% of `inter-token-latency`, will not cause the actual inter token latency to differ by more than 70% from `inter-token-latency`
108
111
-`kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
109
112
-`kv-cache-transfer-latency-std-dev`: standard deviation for time to "transfer" kv-cache from another vLLM instance in case P/D is activated, in milliseconds, optional, default is 0, can't be more than 30% of `kv-cache-transfer-latency`, will not cause the actual latency to differ by more than 70% from `kv-cache-transfer-latency`
113
+
---
114
+
-`prefill-overhead`: The base overhead in milliseconds for prefilling a single token. This value, in conjunction with `prefill-complexity` and `prefill-overhead-std-dev`, determines the overall time taken to prefill the entire context. It's an optional parameter with a default of `0` and is ignored if `time-to-first-token` is not `0`.
115
+
-`prefill-complexity`: Defines how the prefill time scales with the number of prompt tokens. This is required if `prefill-overhead` is used. Options are `"n^2"` and `"nlog(n)"`, with a default of `"n^2"`.
116
+
-`prefill-overhead-std-dev`: The standard deviation in milliseconds for the time taken before the first token is returned. This is required if `prefill-overhead` is used, with a default of `0`.
117
+
-`kv-cache-transfer-overhead`: The base overhead in milliseconds for transferring the KV-cache of a single token from another vLLM instance when P/D is activated. Along with `kv-cache-transfer-complexity` and `kv-cache-transfer-overhead-std-dev`, it defines the total time for the KV-cache transfer of the entire context. This parameter is optional with a default of `0` and is ignored if `kv-cache-transfer-latency` is not `0`.
118
+
-`kv-cache-transfer-complexity`: The complexity of the KV-cache transfer relative to the number of prompt tokens. This is required if `kv-cache-transfer-overhead` is used. Options are `"linear"` and `"in-place"`, with a default of `"linear"`.
119
+
-`kv-cache-transfer-overhead-std-dev`: The standard deviation in milliseconds for the time taken to transfer the KV-cache. This is required if `kv-cache-transfer-overhead` is used, with a default of `0`.
120
+
---
110
121
-`seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
122
+
---
111
123
-`max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
112
124
-`min-tool-call-integer-param`: the minimum possible value of integer parameters in a tool call, optional, defaults to 0
113
125
-`max-tool-call-number-param`: the maximum possible value of number (float) parameters in a tool call, optional, defaults to 100
@@ -116,6 +128,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
116
128
-`min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
117
129
-`tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
118
130
-`object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
131
+
---
119
132
-`enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
120
133
-`kv-cache-size`: the maximum number of token blocks in kv cache
121
134
-`block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
@@ -124,6 +137,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
124
137
-`zmq-endpoint`: ZMQ address to publish events
125
138
-`zmq-max-connect-attempts`: the maximum number of ZMQ connection attempts, defaults to 0, maximum: 10
126
139
-`event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
140
+
---
127
141
-`fake-metrics`: represents a predefined set of metrics to be sent to Prometheus as a substitute for the real metrics. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together. The set should include values for
128
142
-`running-requests`
129
143
-`waiting-requests`
@@ -174,9 +188,9 @@ To run the vLLM Simulator image under Docker, run:
**Note:** To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
191
+
Note: To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
178
192
179
-
**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
193
+
Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
// KVCacheTransfer overhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated,
93
-
// in milliseconds.
94
-
// KVCacheTransferOverhead along with KVCacheTransferComplexity defines the time taken to transfer kv-cache.
96
+
// KVCacheTransferOverhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated, for one token
97
+
// in milliseconds, in conjunction with KVCacheTransferComplexity and KVCacheTransferOverheadStdDev defines the time taken to transfer kv-cache for whole content
0 commit comments