You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-7Lines changed: 2 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,8 +33,6 @@ The simulator supports two modes of operation:
33
33
-`echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
34
34
-`random` mode: the response is randomly chosen from a set of pre-defined sentences.
35
35
36
-
Additionally, the simulator can inject OpenAI API compatible error responses for testing error handling using the `failure-injection-rate` parameter.
37
-
38
36
Timing of the response is defined by the `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`.
39
37
40
38
For a request with `stream=true`: `time-to-first-token` or `kv-cache-transfer-latency` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
@@ -118,16 +116,13 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
118
116
-`min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
119
117
-`tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
120
118
-`object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
121
-
-`enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
119
+
-`enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
122
120
-`kv-cache-size`: the maximum number of token blocks in kv cache
123
121
-`block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
124
122
-`tokenizers-cache-dir`: the directory for caching tokenizers
125
123
-`hash-seed`: seed for hash generation (if not set, is read from PYTHONHASHSEED environment variable)
126
124
-`zmq-endpoint`: ZMQ address to publish events
127
-
-`failure-injection-rate`: probability (0-100) of injecting failures, optional, default is 0
128
-
-`failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
129
-
-`event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
130
-
-->
125
+
131
126
In addition, as we are using klog, the following parameters are available:
132
127
-`add_dir_header`: if true, adds the file directory to the header of the log messages
133
128
-`alsologtostderr`: log to standard error as well as files (no effect when -logtostderr=true)
f.IntVar(&config.MaxCPULoras, "max-cpu-loras", config.MaxCPULoras, "Maximum number of LoRAs to store in CPU memory")
361
321
f.IntVar(&config.MaxModelLen, "max-model-len", config.MaxModelLen, "Model's context window, maximum number of tokens in a single request including input and output")
362
322
363
-
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences")
323
+
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode, echo - returns the same text that was sent in the request, for chat completion returns the last message, random - returns random sentence from a bank of pre-defined sentences")
364
324
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
365
325
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
366
326
f.IntVar(&config.KVCacheTransferLatency, "kv-cache-transfer-latency", config.KVCacheTransferLatency, "Time for KV-cache transfer from a remote vLLM (in milliseconds)")
f.Var(&dummyFailureTypes, "failure-types", "List of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found)")
394
-
f.Lookup("failure-types").NoOptDefVal="dummy"
395
347
396
348
// These values were manually parsed above in getParamValueFromArgs, we leave this in order to get these flags in --help
0 commit comments