You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Triton Model Analyzer is a CLI tool which can help you find a more optimal configuration, on a given piece of hardware, for single, multiple, ensemble, or BLS models running on a [Triton Inference Server](https://github.com/triton-inference-server/server/). Model Analyzer will also generate reports to help you better understand the trade-offs of the different configurations along with their compute and memory requirements.
29
30
<br><br>
30
31
@@ -55,6 +56,9 @@ Triton Model Analyzer is a CLI tool which can help you find a more optimal confi
55
56
-[Multi-Model Search](docs/config_search.md#multi-model-search-mode): Model Analyzer can help you
56
57
find the optimal settings when profiling multiple concurrent models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm
57
58
59
+
-[LLM Search](docs/config_search.md#llm-search-mode): Model Analyzer can help you
60
+
find the optimal settings when profiling large language models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm
61
+
58
62
### Other Features
59
63
60
64
-[Detailed and summary reports](docs/report.md): Model Analyzer is able to generate
@@ -303,6 +304,37 @@ After Model Analyzer has found the best config(s), it will then sweep the top-N
303
304
304
305
---
305
306
307
+
## LLM Search
308
+
309
+
_This mode has the following limitations:_
310
+
311
+
- Summary/Detailed reports do not include the new metrics
312
+
313
+
In order to profile LLMs you must tell MA that the model type is LLM by setting `--model-type LLM` in the CLI/config file. You can specify CLI options to the GenAI-Perf tool using `genai_perf_flags`. See the [GenAI-Perf CLI](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/genai-perf/README.md#cli) documentation for a list of the flags that can be specified.
314
+
315
+
LLMs can be optimized using either Quick or Brute search mode.
316
+
317
+
_An example model analyzer YAML config for a LLM:_
318
+
319
+
```yaml
320
+
model_repository: /path/to/model/repository/
321
+
322
+
model_type: LLM
323
+
client_prototcol: grpc
324
+
325
+
genai_perf_flags:
326
+
backend: vllm
327
+
streaming: true
328
+
```
329
+
330
+
For LLMs there are three new metrics being reported: **Inter-token Latency**, **Time to First Token Latency** and **Output Token Throughput**.
331
+
332
+
These new metrics can be specified as either objectives or constraints.
333
+
334
+
_**NOTE: In order to enable these new metrics you must enable `streaming` in `genai_perf_flags` and the `client protocol` must be set to `gRPC`**_
0 commit comments