Adding LLM Metrics to MA documentation (#856)

nv-braf · web-flow · commit c46418761bb4 · 2024-04-09T13:50:27.000-07:00
* Updating documentation for LLM metric support

* Rewriting LLM config search paragraph for clarity
diff --git a/README.md b/README.md
@@ -19,12 +19,13 @@ limitations under the License.
 # Triton Model Analyzer
 
 > [!Warning]
+>
 > ##### LATEST RELEASE
+>
 > You are currently on the `main` branch which tracks under-development progress towards the next release. <br>
 > The latest release of the Triton Model Analyzer is 1.38.0 and is available on branch
 > [r24.03](https://github.com/triton-inference-server/model_analyzer/tree/r24.03).
 
-
 Triton Model Analyzer is a CLI tool which can help you find a more optimal configuration, on a given piece of hardware, for single, multiple, ensemble, or BLS models running on a [Triton Inference Server](https://github.com/triton-inference-server/server/). Model Analyzer will also generate reports to help you better understand the trade-offs of the different configurations along with their compute and memory requirements.
 <br><br>
 
@@ -55,6 +56,9 @@ Triton Model Analyzer is a CLI tool which can help you find a more optimal confi
 - [Multi-Model Search](docs/config_search.md#multi-model-search-mode): Model Analyzer can help you
   find the optimal settings when profiling multiple concurrent models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm
 
+- [LLM Search](docs/config_search.md#llm-search-mode): Model Analyzer can help you
+  find the optimal settings when profiling large language models, utilizing the [Quick Search](docs/config_search.md#quick-search-mode) algorithm
+
 ### Other Features
 
 - [Detailed and summary reports](docs/report.md): Model Analyzer is able to generate
diff --git a/docs/config.md b/docs/config.md
@@ -303,6 +303,9 @@ cpu_only_composing_models: <comma-delimited-string-list>
 # Allows custom configuration of perf analyzer instances used by model analyzer
 [ perf_analyzer_flags: <dict> ]
 
+# Allows custom configuration of GenAI-perf instances used by model analyzer
+[ genai_perf_flags: <dict> ]
+
 # Allows custom configuration of the environment variables for tritonserver instances
 # launched by model analyzer
 [ triton_server_environment: <dict> ]
@@ -375,7 +378,7 @@ of the types of constraints allowed:
 | `perf_throughput`         | inf / sec |    min     | Specify minimum desired throughput.                    |
 | `perf_latency_p99`        |    ms     |    max     | Specify maximum tolerable latency or latency budget.   |
 | `output_token_throughput` | tok / sec |    min     | Specify minimum desired output token throughput.       |
-| `inter_token_latency_p99` |    ms     |    max     | Specify maximum tolerable input token latency.         |
+| `inter_token_latency_p99` |    ms     |    max     | Specify maximum tolerable inter token latency.         |
 | `time_to_first_token_p99` |    ms     |    max     | Specify maximum tolerable time to first token latency. |
 | `gpu_used_memory`         |    MB     |    max     | Specify maximum GPU memory used by model.              |
 
@@ -457,15 +460,18 @@ profile_models:
 Objectives specify the sorting criteria for the final results. The fields below
 are supported under this object type:
 
-| Option Name        | Description                                            |
-| :----------------- | :----------------------------------------------------- |
-| `perf_throughput`  | Use throughput as the objective.                       |
-| `perf_latency_p99` | Use latency as the objective.                          |
-| `gpu_used_memory`  | Use GPU memory used by the model as the objective.     |
-| `gpu_free_memory`  | Use GPU memory not used by the model as the objective. |
-| `gpu_utilization`  | Use the GPU utilization as the objective.              |
-| `cpu_used_ram`     | Use RAM used by the model as the objective.            |
-| `cpu_free_ram`     | Use RAM not used by the model as the objective.        |
+| Option Name               | Description                                            |
+| :------------------------ | :----------------------------------------------------- |
+| `perf_throughput`         | Use throughput as the objective.                       |
+| `perf_latency_p99`        | Use latency as the objective.                          |
+| `gpu_used_memory`         | Use GPU memory used by the model as the objective.     |
+| `gpu_free_memory`         | Use GPU memory not used by the model as the objective. |
+| `gpu_utilization`         | Use the GPU utilization as the objective.              |
+| `cpu_used_ram`            | Use RAM used by the model as the objective.            |
+| `cpu_free_ram`            | Use RAM not used by the model as the objective.        |
+| `output_token_throughput` | Use output token throughput as the objective.          |
+| `inter_token_latency_p99` | Use inter token latency as the objective.              |
+| `time_to_first_token_p99` | Use time to first token latency as the objective.      |
 
 An example `objectives` that will sort the results by throughput looks like
 below:
diff --git a/docs/config_search.md b/docs/config_search.md
@@ -24,6 +24,7 @@ limitations under the License.
 - [Quick Search Mode](#quick-search-mode)
 - [Ensemble Model Search](#ensemble-model-search)
 - [BLS Model Search](#bls-model-search)
+- [LLM Search](#llm-search)
 - [Multi-Model Search Mode](#multi-model-search-mode)
 
 <br>
@@ -303,6 +304,37 @@ After Model Analyzer has found the best config(s), it will then sweep the top-N
 
 ---
 
+## LLM Search
+
+_This mode has the following limitations:_
+
+- Summary/Detailed reports do not include the new metrics
+
+In order to profile LLMs you must tell MA that the model type is LLM by setting `--model-type LLM` in the CLI/config file. You can specify CLI options to the GenAI-Perf tool using `genai_perf_flags`. See the [GenAI-Perf CLI](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/genai-perf/README.md#cli) documentation for a list of the flags that can be specified.
+
+LLMs can be optimized using either Quick or Brute search mode.
+
+_An example model analyzer YAML config for a LLM:_
+
+```yaml
+model_repository: /path/to/model/repository/
+
+model_type: LLM
+client_prototcol: grpc
+
+genai_perf_flags:
+  backend: vllm
+  streaming: true
+```
+
+For LLMs there are three new metrics being reported: **Inter-token Latency**, **Time to First Token Latency** and **Output Token Throughput**.
+
+These new metrics can be specified as either objectives or constraints.
+
+_**NOTE: In order to enable these new metrics you must enable `streaming` in `genai_perf_flags` and the `client protocol` must be set to `gRPC`**_
+
+---
+
 ## Multi-Model Search Mode
 
 _This mode has the following limitations:_