You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+91-8Lines changed: 91 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,32 +65,97 @@ API responses contains a subset of the fields provided by the OpenAI API.
65
65
- messages
66
66
- role
67
67
- content
68
+
- tool_calls
69
+
- function
70
+
- name
71
+
- arguments
72
+
- id
73
+
- type
74
+
- index
75
+
- max_tokens
76
+
- max_completion_tokens
77
+
- tools
78
+
- type
79
+
- function
80
+
- name
81
+
- arguments
82
+
- tool_choice
83
+
- logprobs
84
+
- top_logprobs
85
+
- stream_options
86
+
- include_usage
87
+
- do_remote_decode
88
+
- do_remote_prefill
89
+
- remote_block_ids
90
+
- remote_engine_id
91
+
- remote_host
92
+
- remote_port
93
+
- ignore_eos
68
94
- **response**
69
95
- id
70
96
- created
71
97
- model
72
98
- choices
73
-
- index
74
-
- finish_reason
75
-
- message
99
+
- index
100
+
- finish_reason
101
+
- message
102
+
- logprobs
103
+
- content
104
+
- token
105
+
- logprob
106
+
- bytes
107
+
- top_logprobs
108
+
- usage
109
+
- object
110
+
- do_remote_decode
111
+
- do_remote_prefill
112
+
- remote_block_ids
113
+
- remote_engine_id
114
+
- remote_host
115
+
- remote_port
76
116
-`/v1/completions`
77
117
-**request**
78
118
- stream
79
119
- model
80
120
- prompt
81
-
- max_tokens (for future usage)
121
+
- max_tokens
122
+
- stream_options
123
+
- include_usage
124
+
- do_remote_decode
125
+
- do_remote_prefill
126
+
- remote_block_ids
127
+
- remote_engine_id
128
+
- remote_host
129
+
- remote_port
130
+
- ignore_eos
131
+
- logprobs
82
132
-**response**
83
133
- id
84
134
- created
85
135
- model
86
136
- choices
87
-
- text
137
+
- index
138
+
- finish_reason
139
+
- text
140
+
- logprobs
141
+
- tokens
142
+
- token_logprobs
143
+
- top_logprobs
144
+
- text_offset
145
+
- usage
146
+
- object
147
+
- do_remote_decode
148
+
- do_remote_prefill
149
+
- remote_block_ids
150
+
- remote_engine_id
151
+
- remote_host
152
+
- remote_port
88
153
-`/v1/models`
89
154
-**response**
90
-
- object (list)
155
+
- object
91
156
- data
92
157
- id
93
-
- object (model)
158
+
- object
94
159
- created
95
160
- owned_by
96
161
- root
@@ -158,8 +223,22 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
158
223
-`loras` - an array containing LoRA information objects, each with the fields: `running` (a comma-separated list of LoRAs in use by running requests), `waiting` (a comma-separated list of LoRAs to be used by waiting requests), and `timestamp` (seconds since Jan 1 1970, the timestamp of this metric).
159
224
-`ttft-buckets-values` - array of values for time-to-first-token buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inf.
160
225
-`tpot-buckets-values` - array of values for time-per-output-token buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf.
226
+
-`e2erl-buckets-values` - array of values for e2e request latency buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
227
+
960.0, 1920.0, 7680.0, +Inf.
228
+
-`queue-time-buckets-values` - array of values for request queue time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
229
+
960.0, 1920.0, 7680.0, +Inf.
230
+
-`inf-time-buckets-values` - array of values for request inference time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
231
+
960.0, 1920.0, 7680.0, +Inf.
232
+
-`prefill-time-buckets-values` - array of values for request prefill time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
233
+
960.0, 1920.0, 7680.0, +Inf.
234
+
-`decode-time-buckets-values` - array of values for request decode time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
235
+
960.0, 1920.0, 7680.0, +Inf.
236
+
-`request-prompt-tokens` - array of values for prompt-length buckets
237
+
-`request-generation-tokens` - array of values for generation-length buckets
238
+
-`request-params-max-tokens` - array of values for max_tokens parameter buckets
239
+
-`request-success-total` - number of successful requests per finish reason, key: finish-reason (stop, length, etc.).
-`data-parallel-size`: number of ranks to run in Data Parallel deployment, from 1 to 8, default is 1. The ports will be assigned as follows: rank 0 will run on the configured `port`, rank 1 on `port`+1, etc.
@@ -177,6 +256,10 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
177
256
- Example URL `https://huggingface.co/datasets/hf07397/inference-sim-datasets/resolve/91ffa7aafdfd6b3b1af228a517edc1e8f22cd274/huggingface/ShareGPT_Vicuna_unfiltered/conversations.sqlite3`
178
257
-`dataset-in-memory`: If true, the entire dataset will be loaded into memory for faster access. This may require significant memory depending on the size of the dataset. Default is false.
179
258
---
259
+
-`ssl-certfile`: Path to SSL certificate file for HTTPS (optional)
260
+
-`ssl-keyfile`: Path to SSL private key file for HTTPS (optional)
261
+
-`self-signed-certs`: Enable automatic generation of self-signed certificates for HTTPS
262
+
---
180
263
In addition, as we are using klog, the following parameters are available:
181
264
-`add_dir_header`: if true, adds the file directory to the header of the log messages
182
265
-`alsologtostderr`: log to standard error as well as files (no effect when -logtostderr=true)
0 commit comments