You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Removed redundant lines and updated comments and help text to clarify that 'failure-injection-rate' is the probability of injecting failures, not specifically tied to failure mode.
Signed-off-by: Sergey Marunich <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,6 +33,8 @@ The simulator supports two modes of operation:
33
33
-`echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
34
34
-`random` mode: the response is randomly chosen from a set of pre-defined sentences.
35
35
36
+
Additionally, the simulator can inject OpenAI API compatible error responses for testing error handling using the `failure-injection-rate` parameter.
37
+
36
38
Timing of the response is defined by the `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`.
37
39
38
40
For a request with `stream=true`: `time-to-first-token` or `kv-cache-transfer-latency` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
@@ -116,13 +118,14 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
116
118
-`min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
117
119
-`tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
118
120
-`object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
119
-
<!--
120
121
-`enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
121
122
-`kv-cache-size`: the maximum number of token blocks in kv cache
122
123
-`block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
123
124
-`tokenizers-cache-dir`: the directory for caching tokenizers
124
125
-`hash-seed`: seed for hash generation (if not set, is read from PYTHONHASHSEED environment variable)
125
126
-`zmq-endpoint`: ZMQ address to publish events
127
+
-`failure-injection-rate`: probability (0-100) of injecting failures, optional, default is 10
128
+
-`failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
126
129
-`event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
127
130
-->
128
131
In addition, as we are using klog, the following parameters are available:
f.IntVar(&config.MaxCPULoras, "max-cpu-loras", config.MaxCPULoras, "Maximum number of LoRAs to store in CPU memory")
327
361
f.IntVar(&config.MaxModelLen, "max-model-len", config.MaxModelLen, "Model's context window, maximum number of tokens in a single request including input and output")
328
362
329
-
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode, echo - returns the same text that was sent in the request, for chat completion returns the last message, random - returns random sentence from a bank of pre-defined sentences")
363
+
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences")
330
364
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
331
365
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
332
366
f.IntVar(&config.KVCacheTransferLatency, "kv-cache-transfer-latency", config.KVCacheTransferLatency, "Time for KV-cache transfer from a remote vLLM (in milliseconds)")
f.Var(&dummyFailureTypes, "failure-types", "List of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found)")
394
+
f.Lookup("failure-types").NoOptDefVal="dummy"
354
395
355
396
// These values were manually parsed above in getParamValueFromArgs, we leave this in order to get these flags in --help
0 commit comments