@@ -70,6 +70,22 @@ pip install -r requirements.txt
7070# NOTE: Adjust the --tokenizer based on the model being used
7171python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
7272```
73+ Once the server has successfully started, you should see something like this:
74+ ```
75+ ...
76+ +-----------------------+---------+--------+
77+ | Model | Version | Status |
78+ +-----------------------+---------+--------+
79+ | llama-3.1-8b-instruct | 1 | READY | <- Correct Model Loaded in Triton
80+ +-----------------------+---------+--------+
81+ ...
82+ Found model: name='llama-3.1-8b-instruct', backend='vllm'
83+ [WARNING] Adding CORS for the following origins: ['http://localhost']
84+ INFO: Started server process [126]
85+ INFO: Waiting for application startup.
86+ INFO: Application startup complete.
87+ INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully
88+ ```
7389
74904 . Send a ` /v1/chat/completions ` request:
7591 - Note the use of ` jq ` is optional, but provides a nicely formatted output for JSON responses.
@@ -80,6 +96,31 @@ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/
8096 "messages": [{"role": "user", "content": "Say this is a test!"}]
8197}' | jq
8298```
99+ which should provide output that looks like this:
100+ ``` json
101+ {
102+ "id" : " cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79" ,
103+ "choices" : [
104+ {
105+ "finish_reason" : " stop" ,
106+ "index" : 0 ,
107+ "message" :
108+ {
109+ "content" : " This is only a test." ,
110+ "tool_calls" : null ,
111+ "role" : " assistant" ,
112+ "function_call" : null
113+ },
114+ "logprobs" : null
115+ }
116+ ],
117+ "created" : 1727679085 ,
118+ "model" : " llama-3.1-8b-instruct" ,
119+ "system_fingerprint" : null ,
120+ "object" : " chat.completion" ,
121+ "usage" : null
122+ }
123+ ```
83124
841255 . Send a ` /v1/completions ` request:
85126 - Note the use of ` jq ` is optional, but provides a nicely formatted output for JSON responses.
@@ -90,8 +131,30 @@ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json'
90131 "prompt": "Machine learning is"
91132}' | jq
92133```
134+ which should provide an output that looks like this:
135+ ``` json
136+ {
137+ "id" : " cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79" ,
138+ "choices" : [
139+ {
140+ "finish_reason" : " stop" ,
141+ "index" : 0 ,
142+ "logprobs" : null ,
143+ "text" : " a field of computer science that focuses on developing algorithms that allow computers to learn from"
144+ }
145+ ],
146+ "created" : 1727679266 ,
147+ "model" : " llama-3.1-8b-instruct" ,
148+ "system_fingerprint" : null ,
149+ "object" : " text_completion" ,
150+ "usage" : null
151+ }
152+ ```
93153
941546 . Benchmark with ` genai-perf ` :
155+ - To install genai-perf in this container, see the instructions [ here] ( https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38 )
156+ - Or try using genai-perf from the [ SDK container] ( https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38 )
157+
95158``` bash
96159MODEL=" llama-3.1-8b-instruct"
97160TOKENIZER=" meta-llama/Meta-Llama-3.1-8B-Instruct"
@@ -100,12 +163,28 @@ genai-perf \
100163 --tokenizer ${TOKENIZER} \
101164 --service-kind openai \
102165 --endpoint-type chat \
103- --synthetic-input-tokens-mean 256 \
104- --synthetic-input-tokens-stddev 0 \
105- --output-tokens-mean 256 \
106- --output-tokens-stddev 0 \
166+ --url localhost:9000 \
107167 --streaming
108168```
169+ which should provide an output that looks like:
170+ ```
171+ 2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
172+ 2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
173+ NVIDIA GenAI-Perf | LLM Metrics
174+ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
175+ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
176+ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
177+ │ Time to first token (ms) │ 71.66 │ 64.32 │ 86.52 │ 76.13 │ 74.92 │ 73.26 │
178+ │ Inter token latency (ms) │ 18.47 │ 18.25 │ 18.72 │ 18.67 │ 18.61 │ 18.53 │
179+ │ Request latency (ms) │ 348.00 │ 274.60 │ 362.27 │ 355.41 │ 352.29 │ 350.66 │
180+ │ Output sequence length │ 15.96 │ 12.00 │ 16.00 │ 16.00 │ 16.00 │ 16.00 │
181+ │ Input sequence length │ 549.66 │ 548.00 │ 551.00 │ 550.00 │ 550.00 │ 550.00 │
182+ │ Output token throughput (per sec) │ 45.84 │ N/A │ N/A │ N/A │ N/A │ N/A │
183+ │ Request throughput (per sec) │ 2.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
184+ └───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
185+ 2024-10-14 22:44 [INFO] genai_perf.export_data.json_exporter:62 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.json
186+ 2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv
187+ ```
109188
1101897 . Use the OpenAI python client directly:
111190``` python
@@ -142,8 +221,9 @@ pytest -v tests/
142221
143222## TensorRT-LLM
144223
145- 0 . Prepare your model repository for serving a TensorRT-LLM model:
146- https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#quick-start
224+ 0 . Prepare your model repository for a TensorRT-LLM model, build the engine, etc. You can try any of the following options:
225+ - [ Triton CLI] ( https://github.com/triton-inference-server/triton_cli/ )
226+ - [ TRT-LLM Backend Quickstart] ( https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#quick-start )
147227
1482281 . Launch the container:
149229 - Mounts the ` ~/.huggingface/cache ` for re-use of downloaded models across runs, containers, etc.
@@ -171,20 +251,46 @@ pip install -r requirements.txt
1712512 . Launch the OpenAI server:
172252``` bash
173253# NOTE: Adjust the --tokenizer based on the model being used
174- python3 openai_frontend/main.py --model-repository tests/tensorrtllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
254+ python3 openai_frontend/main.py --model-repository path/to/models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
175255```
176256
1772573 . Send a ` /v1/chat/completions ` request:
178258 - Note the use of ` jq ` is optional, but provides a nicely formatted output for JSON responses.
179259``` bash
260+ # MODEL should be the client-facing model name in your model repository for a pipeline like TRT-LLM.
261+ # For example, this could also be "ensemble", or something like "gpt2" if generated from Triton CLI
180262MODEL=" tensorrt_llm_bls"
181263curl -s http://localhost:9000/v1/chat/completions -H ' Content-Type: application/json' -d ' {
182264 "model": "' ${MODEL} ' ",
183265 "messages": [{"role": "user", "content": "Say this is a test!"}]
184266}' | jq
185267```
268+ which should provide an output that looks like this:
269+ ``` json
270+ {
271+ "id" : " cmpl-704c758c-8a84-11ef-b106-107c6149ca79" ,
272+ "choices" : [
273+ {
274+ "finish_reason" : " stop" ,
275+ "index" : 0 ,
276+ "message" : {
277+ "content" : " It looks like you're testing the system!" ,
278+ "tool_calls" : null ,
279+ "role" : " assistant" ,
280+ "function_call" : null
281+ },
282+ "logprobs" : null
283+ }
284+ ],
285+ "created" : 1728948689 ,
286+ "model" : " llama-3-8b-instruct" ,
287+ "system_fingerprint" : null ,
288+ "object" : " chat.completion" ,
289+ "usage" : null
290+ }
291+ ```
186292
187- The other examples should be the same as vLLM, except that you should set ` MODEL="tensorrt_llm_bls" ` ,
293+ The other examples should be the same as vLLM, except that you should set ` MODEL="tensorrt_llm_bls" ` or ` MODEL="ensemble" ` ,
188294everywhere applicable as seen in the example request above.
189295
190296## KServe Frontends
0 commit comments