Skip to content

Commit 5a098a8

Browse files
authored
Readme updates: update structure of request and response, add missing command line parameters, add missing fake metrics fields (#242)
Signed-off-by: Maya Barnea <[email protected]>
1 parent 5672194 commit 5a098a8

File tree

1 file changed

+91
-8
lines changed

1 file changed

+91
-8
lines changed

README.md

Lines changed: 91 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -65,32 +65,97 @@ API responses contains a subset of the fields provided by the OpenAI API.
6565
- messages
6666
- role
6767
- content
68+
- tool_calls
69+
- function
70+
- name
71+
- arguments
72+
- id
73+
- type
74+
- index
75+
- max_tokens
76+
- max_completion_tokens
77+
- tools
78+
- type
79+
- function
80+
- name
81+
- arguments
82+
- tool_choice
83+
- logprobs
84+
- top_logprobs
85+
- stream_options
86+
- include_usage
87+
- do_remote_decode
88+
- do_remote_prefill
89+
- remote_block_ids
90+
- remote_engine_id
91+
- remote_host
92+
- remote_port
93+
- ignore_eos
6894
- **response**
6995
- id
7096
- created
7197
- model
7298
- choices
73-
- index
74-
- finish_reason
75-
- message
99+
- index
100+
- finish_reason
101+
- message
102+
- logprobs
103+
- content
104+
- token
105+
- logprob
106+
- bytes
107+
- top_logprobs
108+
- usage
109+
- object
110+
- do_remote_decode
111+
- do_remote_prefill
112+
- remote_block_ids
113+
- remote_engine_id
114+
- remote_host
115+
- remote_port
76116
- `/v1/completions`
77117
- **request**
78118
- stream
79119
- model
80120
- prompt
81-
- max_tokens (for future usage)
121+
- max_tokens
122+
- stream_options
123+
- include_usage
124+
- do_remote_decode
125+
- do_remote_prefill
126+
- remote_block_ids
127+
- remote_engine_id
128+
- remote_host
129+
- remote_port
130+
- ignore_eos
131+
- logprobs
82132
- **response**
83133
- id
84134
- created
85135
- model
86136
- choices
87-
- text
137+
- index
138+
- finish_reason
139+
- text
140+
- logprobs
141+
- tokens
142+
- token_logprobs
143+
- top_logprobs
144+
- text_offset
145+
- usage
146+
- object
147+
- do_remote_decode
148+
- do_remote_prefill
149+
- remote_block_ids
150+
- remote_engine_id
151+
- remote_host
152+
- remote_port
88153
- `/v1/models`
89154
- **response**
90-
- object (list)
155+
- object
91156
- data
92157
- id
93-
- object (model)
158+
- object
94159
- created
95160
- owned_by
96161
- root
@@ -158,8 +223,22 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
158223
- `loras` - an array containing LoRA information objects, each with the fields: `running` (a comma-separated list of LoRAs in use by running requests), `waiting` (a comma-separated list of LoRAs to be used by waiting requests), and `timestamp` (seconds since Jan 1 1970, the timestamp of this metric).
159224
- `ttft-buckets-values` - array of values for time-to-first-token buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inf.
160225
- `tpot-buckets-values` - array of values for time-per-output-token buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf.
226+
- `e2erl-buckets-values` - array of values for e2e request latency buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
227+
960.0, 1920.0, 7680.0, +Inf.
228+
- `queue-time-buckets-values` - array of values for request queue time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
229+
960.0, 1920.0, 7680.0, +Inf.
230+
- `inf-time-buckets-values` - array of values for request inference time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
231+
960.0, 1920.0, 7680.0, +Inf.
232+
- `prefill-time-buckets-values` - array of values for request prefill time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
233+
960.0, 1920.0, 7680.0, +Inf.
234+
- `decode-time-buckets-values` - array of values for request decode time buckets, each value in this array is a value for the corresponding bucket. Array may contain less values than number of buckets, all trailing missing values assumed as 0. Buckets upper boundaries are: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0,
235+
960.0, 1920.0, 7680.0, +Inf.
236+
- `request-prompt-tokens` - array of values for prompt-length buckets
237+
- `request-generation-tokens` - array of values for generation-length buckets
238+
- `request-params-max-tokens` - array of values for max_tokens parameter buckets
239+
- `request-success-total` - number of successful requests per finish reason, key: finish-reason (stop, length, etc.).
161240
<br>
162-
Example:<br>
241+
**Example:**<br>
163242
--fake-metrics '{"running-requests":10,"waiting-requests":30,"kv-cache-usage":0.4,"loras":[{"running":"lora4,lora2","waiting":"lora3","timestamp":1257894567},{"running":"lora4,lora3","waiting":"","timestamp":1257894569}]}'
164243
---
165244
- `data-parallel-size`: number of ranks to run in Data Parallel deployment, from 1 to 8, default is 1. The ports will be assigned as follows: rank 0 will run on the configured `port`, rank 1 on `port`+1, etc.
@@ -177,6 +256,10 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
177256
- Example URL `https://huggingface.co/datasets/hf07397/inference-sim-datasets/resolve/91ffa7aafdfd6b3b1af228a517edc1e8f22cd274/huggingface/ShareGPT_Vicuna_unfiltered/conversations.sqlite3`
178257
- `dataset-in-memory`: If true, the entire dataset will be loaded into memory for faster access. This may require significant memory depending on the size of the dataset. Default is false.
179258
---
259+
- `ssl-certfile`: Path to SSL certificate file for HTTPS (optional)
260+
- `ssl-keyfile`: Path to SSL private key file for HTTPS (optional)
261+
- `self-signed-certs`: Enable automatic generation of self-signed certificates for HTTPS
262+
---
180263
In addition, as we are using klog, the following parameters are available:
181264
- `add_dir_header`: if true, adds the file directory to the header of the log messages
182265
- `alsologtostderr`: log to standard error as well as files (no effect when -logtostderr=true)

0 commit comments

Comments
 (0)