Add early access documentation for multi-model (#541)

nv-braf · web-flow · commit 7b2dae6ce321 · 2022-10-07T13:13:56.000-07:00
* First pass at adding mutli-model documentation

* Further revisions

* Updated based on Tim's review comments

* Adding missing parameters in example

* Changed step 2 to indicate where the tritonserver container actually comes from

* Removing we
diff --git a/README.md b/README.md
@@ -33,37 +33,40 @@ Triton Inference Server.
 
 ## Features
 
-* [Brute and Quick search](docs/config_search.md): Model Analyzer can
-help you automatically find the optimal settings for
-[Max Batch Size](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#maximum-batch-size),
-[Dynamic Batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher), and
-[Instance Group](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups)
-parameters of your model configuration. Model Analyzer utilizes 
-[Performance Analyzer](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md) 
-to test the model with different concurrency and batch sizes of requests. Using
-[Manual Config Search](docs/config_search.md#manual-brute-search), you can create manual sweeps for every parameter that can be specified in the model configuration.
-
-* [Detailed and summary reports](docs/report.md): Model Analyzer is able to generate
-summarized and detailed reports that can help you better understand the trade-offs
-between different model configurations that can be used for your model.
-
-* [QoS Constraints](docs/config.md#constraint): Constraints can help you
-filter out the Model Analyzer results based on your QoS requirements. For
-example, you can specify a latency budget to filter out model configurations
-that do not satisfy the specified latency threshold.
+- [Brute and Quick search](docs/config_search.md): Model Analyzer can
+  help you automatically find the optimal settings for
+  [Max Batch Size](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#maximum-batch-size),
+  [Dynamic Batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher), and
+  [Instance Group](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups)
+  parameters of your model configuration. Model Analyzer utilizes
+  [Performance Analyzer](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md)
+  to test the model with different concurrency and batch sizes of requests. Using
+  [Manual Config Search](docs/config_search.md#manual-brute-search), you can create manual sweeps for every parameter that can be specified in the model configuration.
+
+- [Multi-Model Search](docs/config_search.md#multi-model-search-mode): **EARLY ACCESS** - Model Analyzer can help you
+  find the optimal settings when profiling multiple concurrent models, utilizing our Quick Search alogrithm
+
+- [Detailed and summary reports](docs/report.md): Model Analyzer is able to generate
+  summarized and detailed reports that can help you better understand the trade-offs
+  between different model configurations that can be used for your model.
+
+- [QoS Constraints](docs/config.md#constraint): Constraints can help you
+  filter out the Model Analyzer results based on your QoS requirements. For
+  example, you can specify a latency budget to filter out model configurations
+  that do not satisfy the specified latency threshold.
 
 ## Documentation
 
-* [Installation](docs/install.md)
-* [Quick Start](docs/quick_start.md)
-* [Model Analyzer CLI](docs/cli.md)
-* [Launch Modes](docs/launch_modes.md)
-* [Configuring Model Analyzer](docs/config.md)
-* [Model Analyzer Metrics](docs/metrics.md)
-* [Model Config Search](docs/config_search.md)
-* [Checkpointing](docs/checkpoints.md)
-* [Model Analyzer Reports](docs/report.md)
-* [Deployment with Kubernetes](docs/kubernetes_deploy.md)
+- [Installation](docs/install.md)
+- [Quick Start](docs/quick_start.md)
+- [Model Analyzer CLI](docs/cli.md)
+- [Launch Modes](docs/launch_modes.md)
+- [Configuring Model Analyzer](docs/config.md)
+- [Model Analyzer Metrics](docs/metrics.md)
+- [Model Config Search](docs/config_search.md)
+- [Checkpointing](docs/checkpoints.md)
+- [Model Analyzer Reports](docs/report.md)
+- [Deployment with Kubernetes](docs/kubernetes_deploy.md)
 
 # Reporting problems, asking questions
 
@@ -72,14 +75,14 @@ project. When help with code is needed, follow the process outlined in
 the Stack Overflow (https://stackoverflow.com/help/mcve)
 document. Ensure posted examples are:
 
-* minimal – use as little code as possible that still produces the
+- minimal – use as little code as possible that still produces the
   same problem
 
-* complete – provide all parts needed to reproduce the problem. Check
+- complete – provide all parts needed to reproduce the problem. Check
   if you can strip external dependency and still show the problem. The
   less time we spend on reproducing problems the more time we have to
   fix it
 
-* verifiable – test the code you're about to provide to make sure it
+- verifiable – test the code you're about to provide to make sure it
   reproduces the problem. Remove all other problems that are not
   related to your request/question.
diff --git a/docs/config.md b/docs/config.md
@@ -132,16 +132,16 @@ profile_models: <comma-delimited-string-list>
 # Triton Docker image tag used when launching using Docker mode
 [ triton_docker_image: <string> | default: nvcr.io/nvidia/tritonserver:22.09-py3 ]
 
-# Triton Server HTTP endpoint url used by Model Analyzer client. Will be ignored if server-launch-mode is not 'remote'".
+# Triton Server HTTP endpoint url used by Model Analyzer client.".
 [ triton_http_endpoint: <string> | default: localhost:8000 ]
 
 # The full path to the parent directory of 'lib/libtritonserver.so. Only required when using triton_launch_mode=c_api.
 [ triton_install_path: <string> | default: /opt/tritonserver ]
 
-# Triton Server GRPC endpoint url used by Model Analyzer client. Will be ignored if server-launch-mode is not 'remote'".
+# Triton Server GRPC endpoint url used by Model Analyzer client.".
 [ triton_grpc_endpoint: <string> | default: localhost:8001 ]
 
-# Triton Server metrics endpoint url used by Model Analyzer client. Will be ignored if server-launch-mode is not 'remote'".
+# Triton Server metrics endpoint url used by Model Analyzer client.".
 [ triton_metrics_url: <string> | default: http://localhost:8002/metrics ]
 
 # The full path to the tritonserver binary executable
@@ -189,6 +189,9 @@ profile_models: <comma-delimited-string-list>
 # Disables automatic config search
 [ run_config_search_disable: <bool> | default: false ]
 
+# Enables the profiling of all supplied models concurrently
+[ run_config_profile_models_concurrently_enable: <bool> | default: false]
+
 # Skips the generation of analysis summary reports and tables
 [ skip_summary_reports: <bool> | default: false]
 
@@ -583,10 +586,10 @@ specified on a per model basis and cannot be specified globally (like
 Table below presents the list of common parameters that can be used for manual
 sweeping:
 
-|                                                              Option                                                              |                                                                                                                                                               Description                                                                                                                                                               |
-| :------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+|                                                                   Option                                                                    |                                                                                                                                                               Description                                                                                                                                                               |
+| :-----------------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
 | [`dynamic_batching`](https://github.com/triton-inference-server/server/blob/master/docs/user_guide/model_configuration.md#dynamic-batcher)  |                                                                                              Dynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically.                                                                                               |
-| [`max_batch_size`](https://github.com/triton-inference-server/server/blob/master/docs/user_guide/model_configuration.md#maximum-batch-size) |                                       The max_batch_size property indicates the maximum batch size that the model supports for the [types of batching](https://github.com/triton-inference-server/server/blob/master/docs/user_guide/architecture.md#models-and-schedulers) that can be exploited by Triton.                                       |
+| [`max_batch_size`](https://github.com/triton-inference-server/server/blob/master/docs/user_guide/model_configuration.md#maximum-batch-size) |                                 The max_batch_size property indicates the maximum batch size that the model supports for the [types of batching](https://github.com/triton-inference-server/server/blob/master/docs/user_guide/architecture.md#models-and-schedulers) that can be exploited by Triton.                                  |
 |  [`instance_group`](https://github.com/triton-inference-server/server/blob/master/docs/user_guide/model_configuration.md#instance-groups)   | Triton can provide multiple instances of a model so that multiple inference requests for that model can be handled simultaneously. The model configuration ModelInstanceGroup property is used to specify the number of execution instances that should be made available and what compute resource should be used for those instances. |
 
 An example `<model-config-parameters>` look like below:
@@ -700,7 +703,7 @@ perf_analyzer_flags:
 
 #### Model-specific options for Perf Analyzer
 
-In order to set flags only for a specific model, you can specify 
+In order to set flags only for a specific model, you can specify
 the flags in the following way:
 
 ```yaml
@@ -736,8 +739,8 @@ then the `shape` option of the `perf_analyzer_flags` option must be specified.
 More information about this can be found in the
 [Perf Analyzer documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md#input-data).
 
-
 #### SSL Support:
+
 Perf Analyzer supports SSL via GRPC and HTTP. It can be enabled via Model Analyzer configuration file updates.
 
 GRPC example:
diff --git a/docs/config_search.md b/docs/config_search.md
@@ -17,12 +17,17 @@ limitations under the License.
 # Model Config Search
 
 Model Analyzer's `profile` subcommand supports multiple modes when searching to find the best model configuration.
-* [Brute](config_search.md#brute-search-mode) is the default, and will do a brute-force sweep of the cross product of all possible configurations
-* [Quick](config_search.md#quick-search-mode) will use heuristics to try to find the optimal configuration much quicker than brute, and can be enabled via `--run-config-search-mode quick`
+
+- [Brute](config_search.md#brute-search-mode) is the default, and will do a brute-force sweep of the cross product of all possible configurations
+- [Quick](config_search.md#quick-search-mode) will use heuristics to try to find the optimal configuration much quicker than brute, and can be enabled via `--run-config-search-mode quick`
+
+_This is mode is in **EARLY ACCESS** and is limited in scope:_
+
+- [Multi-model](config_search.md#multi-model-search-mode) will profile mutliple models to find the optimal configurations for all models while they are running concurrently. This feature is enabled via `--run-config-profile-models-concurrently-enable`
 
 ## Brute Search Mode
 
-Model Analyzer's brute search mode will do a brute-force sweep of the cross product of all possible configurations. You can [Manually](config_search.md#manual-brute-search) provide `model_config_parameters` to tell Model Analyzer what to sweep over, or you can 
+Model Analyzer's brute search mode will do a brute-force sweep of the cross product of all possible configurations. You can [Manually](config_search.md#manual-brute-search) provide `model_config_parameters` to tell Model Analyzer what to sweep over, or you can
 let it [Automatically](config_search.md#automatic-brute-search) sweep through configurations expected to have the highest impact on performance for Triton models.
 
 ### Automatic Brute Search
@@ -86,7 +91,8 @@ model_repository: /path/to/model/repository/
 
 profile_models:
   model_1:
-    concurrency: 1,2,3,128
+    parameters:
+      concurrency: 1,2,3,128
 ```
 
 The config described below will only sweep through different values for
@@ -123,7 +129,7 @@ the configuration that is generated is loadable by Triton.
 
 You can also specify `concurrency` ranges to sweep through. If unspecified, it will
 automatically sweep concurrency for every model configuration (unless `--run-config-search-disable`
- is set, in which case it will only use the concurrency value of 1)
+is set, in which case it will only use the concurrency value of 1)
 
 An example Model Analyzer Config that performs manual sweeping looks like below:
 
@@ -174,3 +180,21 @@ the maximal objective value within the specified constraints. In the majority of
 this will find greater than 95% of the maximum objective value (that could be found using a brute force search), while needing to search less than 10% of the configuration space.
 
 After it has found the best config(s), it will then sweep the top-N configurations found (specified by `--num-configs-per-model`) over the default concurrency range before generation of the summary reports.
+
+## Multi-Model Search Mode
+
+_This mode is in EARLY ACCESS and has the following limitations:_
+
+- Can only be run in `quick` search mode
+- Cannot set limitations on min/max batch size, concurrency or instance count
+- Does not support individual model constraints, only global constraints
+- Does not support individual model weighting, all models are treated with equal priority when trying to maximize objective value
+- Does not support detailed reporting, only summary reports
+
+Multi-model concurrent search mode can be enabled by adding the parameter `--run-config-profile-models-concurrently-enable` to the CLI.
+
+It uses Quick Search mode's hill climbing algorithm to search all models configurations spaces in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes of around 20-30 minutes (compared to the days it would take a brute force run to complete).
+
+After it has found the best config(s), it will then sweep the top-N configurations found (specified by `--num-configs-per-model`) over the default concurrency range before generation of the summary reports.
+
+_Note:_ The algorithm attempts to find the most fair and optimal result for all models, by evaluating each model objective's gain/loss. In many cases this will result in the algorithm ranking higher a configuration that has a lower total combined throughput (if that was the objective), if this better balances the throughputs of all the models.
diff --git a/docs/install.md b/docs/install.md
@@ -95,7 +95,7 @@ cd ./model_analyzer
 docker build --pull -t model-analyzer .
 ```
 
-Model Analyzer's Dockerfile bases the container on the latest `tritonserver`
+Model Analyzer's Dockerfile bases the container on the corresponding `tritonserver` (from step 1)
 containers from NGC.<br><br>
 
 **3. Run the Container**
diff --git a/model_analyzer/config/input/config_command_profile.py b/model_analyzer/config/input/config_command_profile.py
@@ -610,25 +610,23 @@ def _add_triton_configs(self):
                 field_type=ConfigPrimitive(str),
                 default_value=DEFAULT_TRITON_HTTP_ENDPOINT,
                 description=
-                "Triton Server HTTP endpoint url used by Model Analyzer client. "
-                "Will be ignored if server-launch-mode is not 'remote'"))
+                "Triton Server HTTP endpoint url used by Model Analyzer client."
+            ))
         self._add_config(
             ConfigField(
                 'triton_grpc_endpoint',
                 flags=['--triton-grpc-endpoint'],
                 field_type=ConfigPrimitive(str),
                 default_value=DEFAULT_TRITON_GRPC_ENDPOINT,
                 description=
-                "Triton Server HTTP endpoint url used by Model Analyzer client. "
-                "Will be ignored if server-launch-mode is not 'remote'"))
+                "Triton Server HTTP endpoint url used by Model Analyzer client."
+            ))
         self._add_config(
-            ConfigField(
-                'triton_metrics_url',
-                field_type=ConfigPrimitive(str),
-                flags=['--triton-metrics-url'],
-                default_value=DEFAULT_TRITON_METRICS_URL,
-                description="Triton Server Metrics endpoint url. "
-                "Will be ignored if server-launch-mode is not 'remote'"))
+            ConfigField('triton_metrics_url',
+                        field_type=ConfigPrimitive(str),
+                        flags=['--triton-metrics-url'],
+                        default_value=DEFAULT_TRITON_METRICS_URL,
+                        description="Triton Server Metrics endpoint url. "))
         self._add_config(
             ConfigField(
                 'triton_server_path',