You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduces a 'failure' mode to the simulator, allowing random injection of OpenAI API-compatible error responses for testing error handling. Adds configuration options for failure injection rate and specific failure types, implements error response logic, and updates documentation and tests to cover the new functionality.
Signed-off-by: Sergey Marunich <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,9 +29,10 @@ In addition, it supports a subset of vLLM's Prometheus metrics. These metrics ar
29
29
30
30
The simulated inference has no connection with the model and LoRA adapters specified in the command line parameters or via the /v1/load_lora_adapter HTTP REST endpoint. The /v1/models endpoint returns simulated results based on those same command line parameters and those loaded via the /v1/load_lora_adapter HTTP REST endpoint.
31
31
32
-
The simulator supports two modes of operation:
32
+
The simulator supports three modes of operation:
33
33
-`echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
34
34
-`random` mode: the response is randomly chosen from a set of pre-defined sentences.
35
+
-`failure` mode: randomly injects OpenAI API compatible error responses for testing error handling.
35
36
36
37
Timing of the response is defined by the `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`.
37
38
@@ -101,6 +102,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
101
102
-`mode`: the simulator mode, optional, by default `random`
102
103
-`echo`: returns the same text that was sent in the request
103
104
-`random`: returns a sentence chosen at random from a set of pre-defined sentences
105
+
-`failure`: randomly injects OpenAI API compatible error responses
104
106
-`time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
105
107
-`time-to-first-token-std-dev`: standard deviation for time before the first token will be returned, in milliseconds, optional, default is 0, can't be more than 30% of `time-to-first-token`, will not cause the actual time to first token to differ by more than 70% from `time-to-first-token`
106
108
-`inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
@@ -122,6 +124,8 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
122
124
-`tokenizers-cache-dir`: the directory for caching tokenizers
123
125
-`hash-seed`: seed for hash generation (if not set, is read from PYTHONHASHSEED environment variable)
124
126
-`zmq-endpoint`: ZMQ address to publish events
127
+
-`failure-injection-rate`: probability (0-100) of injecting failures when in failure mode, optional, default is 10
128
+
-`failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
125
129
126
130
In addition, as we are using klog, the following parameters are available:
127
131
-`add_dir_header`: if true, adds the file directory to the header of the log messages
f.IntVar(&config.MaxCPULoras, "max-cpu-loras", config.MaxCPULoras, "Maximum number of LoRAs to store in CPU memory")
327
354
f.IntVar(&config.MaxModelLen, "max-model-len", config.MaxModelLen, "Model's context window, maximum number of tokens in a single request including input and output")
328
355
329
-
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode, echo - returns the same text that was sent in the request, for chat completion returns the last message, random - returns random sentence from a bank of pre-defined sentences")
356
+
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences; failure - randomly injects API errors")
330
357
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
331
358
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
332
359
f.IntVar(&config.KVCacheTransferLatency, "kv-cache-transfer-latency", config.KVCacheTransferLatency, "Time for KV-cache transfer from a remote vLLM (in milliseconds)")
f.StringVar(&config.HashSeed, "hash-seed", config.HashSeed, "Seed for hash generation (if not set, is read from PYTHONHASHSEED environment variable)")
352
379
f.StringVar(&config.ZMQEndpoint, "zmq-endpoint", config.ZMQEndpoint, "ZMQ address to publish events")
353
380
f.IntVar(&config.EventBatchSize, "event-batch-size", config.EventBatchSize, "Maximum number of kv-cache events to be sent together")
381
+
382
+
f.IntVar(&config.FailureInjectionRate, "failure-injection-rate", config.FailureInjectionRate, "Probability (0-100) of injecting failures when in failure mode")
f.Var(&dummyFailureTypes, "failure-types", "List of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found)")
387
+
f.Lookup("failure-types").NoOptDefVal="dummy"
354
388
355
389
// These values were manually parsed above in getParamValueFromArgs, we leave this in order to get these flags in --help
failure.Message=fmt.Sprintf("Rate limit reached for %s in organization org-xxx on requests per min (RPM): Limit 3, Used 3, Requested 1.", config.Model)
0 commit comments