You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Signed-off-by: Sergey Marunich <[email protected]>
KV cache and tokenization related configuration (#125)
Signed-off-by: Ira <[email protected]>
Publish kv-cache events (#126)
* Publish kv-cache events
Signed-off-by: Ira <[email protected]>
* Fix lint errors
Signed-off-by: Ira <[email protected]>
* Review fixes
Signed-off-by: Ira <[email protected]>
* Sleep to allow prevous sub to close
Signed-off-by: Ira <[email protected]>
---------
Signed-off-by: Ira <[email protected]>
Signed-off-by: Sergey Marunich <[email protected]>
Use same version of tokenizer in both Dockerfile and Makefile (#132)
* - Use same version of tokenizer in both Dockerfile and Makefile
- Fixes in readme file
Signed-off-by: Maya Barnea <[email protected]>
* updates according PR's review
Signed-off-by: Maya Barnea <[email protected]>
---------
Signed-off-by: Maya Barnea <[email protected]>
Signed-off-by: Sergey Marunich <[email protected]>
Replaces usage of param.NewOpt with openai.Int for MaxTokens and openai.Bool with param.NewOpt for IncludeUsage in simulator_test.go to align with updated API usage.
Signed-off-by: Sergey Marunich <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,8 +33,6 @@ The simulator supports two modes of operation:
33
33
-`echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
34
34
-`random` mode: the response is randomly chosen from a set of pre-defined sentences.
35
35
36
-
Additionally, the simulator can inject OpenAI API compatible error responses for testing error handling using the `failure-injection-rate` parameter.
37
-
38
36
Timing of the response is defined by the `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`.
39
37
40
38
For a request with `stream=true`: `time-to-first-token` or `kv-cache-transfer-latency` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
@@ -118,14 +116,13 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
118
116
-`min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
119
117
-`tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
120
118
-`object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
119
+
<!--
121
120
- `enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
122
121
- `kv-cache-size`: the maximum number of token blocks in kv cache
123
122
- `block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
124
123
- `tokenizers-cache-dir`: the directory for caching tokenizers
125
124
- `hash-seed`: seed for hash generation (if not set, is read from PYTHONHASHSEED environment variable)
126
125
- `zmq-endpoint`: ZMQ address to publish events
127
-
-`failure-injection-rate`: probability (0-100) of injecting failures, optional, default is 0
128
-
-`failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
129
126
- `event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
130
127
-->
131
128
In addition, as we are using klog, the following parameters are available:
f.IntVar(&config.MaxCPULoras, "max-cpu-loras", config.MaxCPULoras, "Maximum number of LoRAs to store in CPU memory")
361
327
f.IntVar(&config.MaxModelLen, "max-model-len", config.MaxModelLen, "Model's context window, maximum number of tokens in a single request including input and output")
362
328
363
-
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences")
329
+
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode, echo - returns the same text that was sent in the request, for chat completion returns the last message, random - returns random sentence from a bank of pre-defined sentences")
364
330
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
365
331
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
366
332
f.IntVar(&config.KVCacheTransferLatency, "kv-cache-transfer-latency", config.KVCacheTransferLatency, "Time for KV-cache transfer from a remote vLLM (in milliseconds)")
f.Var(&dummyFailureTypes, "failure-types", "List of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found)")
394
-
f.Lookup("failure-types").NoOptDefVal="dummy"
395
354
396
355
// These values were manually parsed above in getParamValueFromArgs, we leave this in order to get these flags in --help
0 commit comments