Skip to content

Commit 0219764

Browse files
docs: Add example outputs to OpenAI Frontend docs (#7691)
Co-authored-by: Ryan McCormick <[email protected]>
1 parent 8c6657e commit 0219764

File tree

1 file changed

+114
-8
lines changed

1 file changed

+114
-8
lines changed

python/openai/README.md

Lines changed: 114 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,22 @@ pip install -r requirements.txt
7070
# NOTE: Adjust the --tokenizer based on the model being used
7171
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
7272
```
73+
Once the server has successfully started, you should see something like this:
74+
```
75+
...
76+
+-----------------------+---------+--------+
77+
| Model | Version | Status |
78+
+-----------------------+---------+--------+
79+
| llama-3.1-8b-instruct | 1 | READY | <- Correct Model Loaded in Triton
80+
+-----------------------+---------+--------+
81+
...
82+
Found model: name='llama-3.1-8b-instruct', backend='vllm'
83+
[WARNING] Adding CORS for the following origins: ['http://localhost']
84+
INFO: Started server process [126]
85+
INFO: Waiting for application startup.
86+
INFO: Application startup complete.
87+
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully
88+
```
7389

7490
4. Send a `/v1/chat/completions` request:
7591
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
@@ -80,6 +96,31 @@ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/
8096
"messages": [{"role": "user", "content": "Say this is a test!"}]
8197
}' | jq
8298
```
99+
which should provide output that looks like this:
100+
```json
101+
{
102+
"id": "cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79",
103+
"choices": [
104+
{
105+
"finish_reason": "stop",
106+
"index": 0,
107+
"message":
108+
{
109+
"content": "This is only a test.",
110+
"tool_calls": null,
111+
"role": "assistant",
112+
"function_call": null
113+
},
114+
"logprobs": null
115+
}
116+
],
117+
"created": 1727679085,
118+
"model": "llama-3.1-8b-instruct",
119+
"system_fingerprint": null,
120+
"object": "chat.completion",
121+
"usage": null
122+
}
123+
```
83124

84125
5. Send a `/v1/completions` request:
85126
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
@@ -90,8 +131,30 @@ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json'
90131
"prompt": "Machine learning is"
91132
}' | jq
92133
```
134+
which should provide an output that looks like this:
135+
```json
136+
{
137+
"id": "cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79",
138+
"choices": [
139+
{
140+
"finish_reason": "stop",
141+
"index": 0,
142+
"logprobs": null,
143+
"text": " a field of computer science that focuses on developing algorithms that allow computers to learn from"
144+
}
145+
],
146+
"created": 1727679266,
147+
"model": "llama-3.1-8b-instruct",
148+
"system_fingerprint": null,
149+
"object": "text_completion",
150+
"usage": null
151+
}
152+
```
93153

94154
6. Benchmark with `genai-perf`:
155+
- To install genai-perf in this container, see the instructions [here](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38)
156+
- Or try using genai-perf from the [SDK container](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38)
157+
95158
```bash
96159
MODEL="llama-3.1-8b-instruct"
97160
TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"
@@ -100,12 +163,28 @@ genai-perf \
100163
--tokenizer ${TOKENIZER} \
101164
--service-kind openai \
102165
--endpoint-type chat \
103-
--synthetic-input-tokens-mean 256 \
104-
--synthetic-input-tokens-stddev 0 \
105-
--output-tokens-mean 256 \
106-
--output-tokens-stddev 0 \
166+
--url localhost:9000 \
107167
--streaming
108168
```
169+
which should provide an output that looks like:
170+
```
171+
2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
172+
2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
173+
NVIDIA GenAI-Perf | LLM Metrics
174+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
175+
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
176+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
177+
│ Time to first token (ms) │ 71.66 │ 64.32 │ 86.52 │ 76.13 │ 74.92 │ 73.26 │
178+
│ Inter token latency (ms) │ 18.47 │ 18.25 │ 18.72 │ 18.67 │ 18.61 │ 18.53 │
179+
│ Request latency (ms) │ 348.00 │ 274.60 │ 362.27 │ 355.41 │ 352.29 │ 350.66 │
180+
│ Output sequence length │ 15.96 │ 12.00 │ 16.00 │ 16.00 │ 16.00 │ 16.00 │
181+
│ Input sequence length │ 549.66 │ 548.00 │ 551.00 │ 550.00 │ 550.00 │ 550.00 │
182+
│ Output token throughput (per sec) │ 45.84 │ N/A │ N/A │ N/A │ N/A │ N/A │
183+
│ Request throughput (per sec) │ 2.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
184+
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
185+
2024-10-14 22:44 [INFO] genai_perf.export_data.json_exporter:62 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.json
186+
2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv
187+
```
109188

110189
7. Use the OpenAI python client directly:
111190
```python
@@ -142,8 +221,9 @@ pytest -v tests/
142221

143222
## TensorRT-LLM
144223

145-
0. Prepare your model repository for serving a TensorRT-LLM model:
146-
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#quick-start
224+
0. Prepare your model repository for a TensorRT-LLM model, build the engine, etc. You can try any of the following options:
225+
- [Triton CLI](https://github.com/triton-inference-server/triton_cli/)
226+
- [TRT-LLM Backend Quickstart](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#quick-start)
147227

148228
1. Launch the container:
149229
- Mounts the `~/.huggingface/cache` for re-use of downloaded models across runs, containers, etc.
@@ -171,20 +251,46 @@ pip install -r requirements.txt
171251
2. Launch the OpenAI server:
172252
```bash
173253
# NOTE: Adjust the --tokenizer based on the model being used
174-
python3 openai_frontend/main.py --model-repository tests/tensorrtllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
254+
python3 openai_frontend/main.py --model-repository path/to/models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
175255
```
176256

177257
3. Send a `/v1/chat/completions` request:
178258
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
179259
```bash
260+
# MODEL should be the client-facing model name in your model repository for a pipeline like TRT-LLM.
261+
# For example, this could also be "ensemble", or something like "gpt2" if generated from Triton CLI
180262
MODEL="tensorrt_llm_bls"
181263
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
182264
"model": "'${MODEL}'",
183265
"messages": [{"role": "user", "content": "Say this is a test!"}]
184266
}' | jq
185267
```
268+
which should provide an output that looks like this:
269+
```json
270+
{
271+
"id": "cmpl-704c758c-8a84-11ef-b106-107c6149ca79",
272+
"choices": [
273+
{
274+
"finish_reason": "stop",
275+
"index": 0,
276+
"message": {
277+
"content": "It looks like you're testing the system!",
278+
"tool_calls": null,
279+
"role": "assistant",
280+
"function_call": null
281+
},
282+
"logprobs": null
283+
}
284+
],
285+
"created": 1728948689,
286+
"model": "llama-3-8b-instruct",
287+
"system_fingerprint": null,
288+
"object": "chat.completion",
289+
"usage": null
290+
}
291+
```
186292

187-
The other examples should be the same as vLLM, except that you should set `MODEL="tensorrt_llm_bls"`,
293+
The other examples should be the same as vLLM, except that you should set `MODEL="tensorrt_llm_bls"` or `MODEL="ensemble"`,
188294
everywhere applicable as seen in the example request above.
189295

190296
## KServe Frontends

0 commit comments

Comments
 (0)