You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -135,6 +135,9 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
135
135
-`zmq-max-connect-attempts`: the maximum number of ZMQ connection attempts, defaults to 0, maximum: 10
136
136
-`event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
137
137
---
138
+
-`failure-injection-rate`: probability (0-100) of injecting failures, optional, default is 0
139
+
-`failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
140
+
---
138
141
-`fake-metrics`: represents a predefined set of metrics to be sent to Prometheus as a substitute for the real metrics. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together. The set should include values for
139
142
-`running-requests`
140
143
-`waiting-requests`
@@ -143,7 +146,6 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
f.IntVar(&config.MaxCPULoras, "max-cpu-loras", config.MaxCPULoras, "Maximum number of LoRAs to store in CPU memory")
433
466
f.IntVar(&config.MaxModelLen, "max-model-len", config.MaxModelLen, "Model's context window, maximum number of tokens in a single request including input and output")
434
467
435
-
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode, echo - returns the same text that was sent in the request, for chat completion returns the last message, random - returns random sentence from a bank of pre-defined sentences")
468
+
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences")
436
469
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
437
470
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
f.Var(&dummyFailureTypes, "failure-types", "List of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found)")
507
+
f.Lookup("failure-types").NoOptDefVal=dummy
508
+
469
509
// These values were manually parsed above in getParamValueFromArgs, we leave this in order to get these flags in --help
470
510
vardummyStringstring
471
511
f.StringVar(&dummyString, "config", "", "The path to a yaml configuration file. The command line values overwrite the configuration file values")
// if response should be create with maximum number of tokens - finish reason will be 'length'
176
+
finishReason=LengthFinishReason
177
+
}
159
178
}
160
179
161
180
text:=GetRandomText(numOfTokens)
162
181
returntext, finishReason
163
182
}
164
183
184
+
// getResponseLengthByHistogram calculates the number of tokens to be returned in a response based on the max tokens value and the pre-defined buckets.
185
+
// The response length is distributed according to the probabilities, defined in respLenBucketsProbabilities.
186
+
// The histogram contains equally sized buckets and the last special bucket, which contains only the maxTokens value.
187
+
// The last element of respLenBucketsProbabilities defines the probability of a reposnse with maxToken tokens.
188
+
// Other values define probabilities for the equally sized buckets.
189
+
// If maxToken is small (smaller than number of buckets) - the response length is randomly selected from the range [1, maxTokens]
190
+
funcgetResponseLengthByHistogram(maxTokensint) int {
191
+
ifmaxTokens<=1 {
192
+
returnmaxTokens
193
+
}
194
+
// maxTokens is small - no need to use the histogram of probabilities, just select a random value in the range [1, maxTokens]
0 commit comments