Skip to content

Commit c464187

Browse files
authored
Adding LLM Metrics to MA documentation (#856)
* Updating documentation for LLM metric support * Rewriting LLM config search paragraph for clarity
1 parent 8298d83 commit c464187

File tree

3 files changed

+53
-11
lines changed

3 files changed

+53
-11
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,13 @@ limitations under the License.
1919
# Triton Model Analyzer
2020

2121
> [!Warning]
22+
>
2223
> ##### LATEST RELEASE
24+
>
2325
> You are currently on the `main` branch which tracks under-development progress towards the next release. <br>
2426
> The latest release of the Triton Model Analyzer is 1.38.0 and is available on branch
2527
> [r24.03](https://github.com/triton-inference-server/model_analyzer/tree/r24.03).
2628
27-
2829
Triton Model Analyzer is a CLI tool which can help you find a more optimal configuration, on a given piece of hardware, for single, multiple, ensemble, or BLS models running on a [Triton Inference Server](https://github.com/triton-inference-server/server/). Model Analyzer will also generate reports to help you better understand the trade-offs of the different configurations along with their compute and memory requirements.
2930
<br><br>
3031

@@ -55,6 +56,9 @@ Triton Model Analyzer is a CLI tool which can help you find a more optimal confi
5556
- [Multi-Model Search](docs/config_search.md#multi-model-search-mode): Model Analyzer can help you
5657
find the optimal settings when profiling multiple concurrent models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm
5758

59+
- [LLM Search](docs/config_search.md#llm-search-mode): Model Analyzer can help you
60+
find the optimal settings when profiling large language models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm
61+
5862
### Other Features
5963

6064
- [Detailed and summary reports](docs/report.md): Model Analyzer is able to generate

docs/config.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,9 @@ cpu_only_composing_models: <comma-delimited-string-list>
303303
# Allows custom configuration of perf analyzer instances used by model analyzer
304304
[ perf_analyzer_flags: <dict> ]
305305
306+
# Allows custom configuration of GenAI-perf instances used by model analyzer
307+
[ genai_perf_flags: <dict> ]
308+
306309
# Allows custom configuration of the environment variables for tritonserver instances
307310
# launched by model analyzer
308311
[ triton_server_environment: <dict> ]
@@ -375,7 +378,7 @@ of the types of constraints allowed:
375378
| `perf_throughput` | inf / sec | min | Specify minimum desired throughput. |
376379
| `perf_latency_p99` | ms | max | Specify maximum tolerable latency or latency budget. |
377380
| `output_token_throughput` | tok / sec | min | Specify minimum desired output token throughput. |
378-
| `inter_token_latency_p99` | ms | max | Specify maximum tolerable input token latency. |
381+
| `inter_token_latency_p99` | ms | max | Specify maximum tolerable inter token latency. |
379382
| `time_to_first_token_p99` | ms | max | Specify maximum tolerable time to first token latency. |
380383
| `gpu_used_memory` | MB | max | Specify maximum GPU memory used by model. |
381384

@@ -457,15 +460,18 @@ profile_models:
457460
Objectives specify the sorting criteria for the final results. The fields below
458461
are supported under this object type:
459462

460-
| Option Name | Description |
461-
| :----------------- | :----------------------------------------------------- |
462-
| `perf_throughput` | Use throughput as the objective. |
463-
| `perf_latency_p99` | Use latency as the objective. |
464-
| `gpu_used_memory` | Use GPU memory used by the model as the objective. |
465-
| `gpu_free_memory` | Use GPU memory not used by the model as the objective. |
466-
| `gpu_utilization` | Use the GPU utilization as the objective. |
467-
| `cpu_used_ram` | Use RAM used by the model as the objective. |
468-
| `cpu_free_ram` | Use RAM not used by the model as the objective. |
463+
| Option Name | Description |
464+
| :------------------------ | :----------------------------------------------------- |
465+
| `perf_throughput` | Use throughput as the objective. |
466+
| `perf_latency_p99` | Use latency as the objective. |
467+
| `gpu_used_memory` | Use GPU memory used by the model as the objective. |
468+
| `gpu_free_memory` | Use GPU memory not used by the model as the objective. |
469+
| `gpu_utilization` | Use the GPU utilization as the objective. |
470+
| `cpu_used_ram` | Use RAM used by the model as the objective. |
471+
| `cpu_free_ram` | Use RAM not used by the model as the objective. |
472+
| `output_token_throughput` | Use output token throughput as the objective. |
473+
| `inter_token_latency_p99` | Use inter token latency as the objective. |
474+
| `time_to_first_token_p99` | Use time to first token latency as the objective. |
469475

470476
An example `objectives` that will sort the results by throughput looks like
471477
below:

docs/config_search.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ limitations under the License.
2424
- [Quick Search Mode](#quick-search-mode)
2525
- [Ensemble Model Search](#ensemble-model-search)
2626
- [BLS Model Search](#bls-model-search)
27+
- [LLM Search](#llm-search)
2728
- [Multi-Model Search Mode](#multi-model-search-mode)
2829

2930
<br>
@@ -303,6 +304,37 @@ After Model Analyzer has found the best config(s), it will then sweep the top-N
303304

304305
---
305306

307+
## LLM Search
308+
309+
_This mode has the following limitations:_
310+
311+
- Summary/Detailed reports do not include the new metrics
312+
313+
In order to profile LLMs you must tell MA that the model type is LLM by setting `--model-type LLM` in the CLI/config file. You can specify CLI options to the GenAI-Perf tool using `genai_perf_flags`. See the [GenAI-Perf CLI](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/genai-perf/README.md#cli) documentation for a list of the flags that can be specified.
314+
315+
LLMs can be optimized using either Quick or Brute search mode.
316+
317+
_An example model analyzer YAML config for a LLM:_
318+
319+
```yaml
320+
model_repository: /path/to/model/repository/
321+
322+
model_type: LLM
323+
client_prototcol: grpc
324+
325+
genai_perf_flags:
326+
backend: vllm
327+
streaming: true
328+
```
329+
330+
For LLMs there are three new metrics being reported: **Inter-token Latency**, **Time to First Token Latency** and **Output Token Throughput**.
331+
332+
These new metrics can be specified as either objectives or constraints.
333+
334+
_**NOTE: In order to enable these new metrics you must enable `streaming` in `genai_perf_flags` and the `client protocol` must be set to `gRPC`**_
335+
336+
---
337+
306338
## Multi-Model Search Mode
307339

308340
_This mode has the following limitations:_

0 commit comments

Comments
 (0)