Merge branch 'main' of github.com:triton-inference-server/server into yinggeh-DLIS-7061-add-vllm-metrics

yinggeh · yinggeh · commit 9818033669c3 · 2024-08-09T15:12:40.000-07:00
diff --git a/Dockerfile.sdk b/Dockerfile.sdk
@@ -33,6 +33,7 @@ ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.07-py3-min
 
 ARG TRITON_CLIENT_REPO_SUBDIR=clientrepo
 ARG TRITON_PA_REPO_SUBDIR=perfanalyzerrepo
+ARG TRITON_REPO_ORGANIZATION=http://github.com/triton-inference-server
 ARG TRITON_COMMON_REPO_TAG=main
 ARG TRITON_CORE_REPO_TAG=main
 ARG TRITON_CLIENT_REPO_TAG=main
@@ -217,6 +218,7 @@ WORKDIR /workspace
 COPY TRITON_VERSION .
 COPY NVIDIA_Deep_Learning_Container_License.pdf .
 COPY --from=sdk_build /workspace/client/ client/
+COPY --from=sdk_build /workspace/perf_analyzer/ perf_analyzer/
 COPY --from=sdk_build /workspace/install/ install/
 RUN cd install && \
     export VERSION=`cat /workspace/TRITON_VERSION` && \
diff --git a/README.md b/README.md
@@ -28,6 +28,17 @@
 
 # Triton Inference Server
 
+📣 **vLLM x Triton Meetup at Fort Mason on Sept 9th 4:00 - 9:00 pm**
+
+We are excited to announce that we will be hosting our Triton user meetup with the vLLM team at
+[Fort Mason](https://maps.app.goo.gl/9Lr3fxRssrpQCGK58) on Sept 9th 4:00 - 9:00 pm. Join us for this
+exclusive event where you will learn about the newest vLLM and Triton features, get a
+glimpse into the roadmaps, and connect with fellow users, the NVIDIA Triton and vLLM teams. Seating is limited and registration confirmation
+is required to attend - please register [here](https://lu.ma/87q3nvnh) to join
+the meetup.
+
+___
+
 [![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
 
 [!WARNING]
@@ -179,7 +190,7 @@ configuration](docs/user_guide/model_configuration.md) for the model.
   [Backend-Platform Support Matrix](https://github.com/triton-inference-server/backend/blob/main/docs/backend_platform_support_matrix.md)
   to learn which backends are supported on your target platform.
 - Learn how to [optimize performance](docs/user_guide/optimization.md) using the
-  [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+  [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
   and
   [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
 - Learn how to [manage loading and unloading models](docs/user_guide/model_management.md) in
diff --git a/build.py b/build.py
@@ -1647,6 +1647,10 @@ def core_build(
             os.path.join(repo_install_dir, "bin", "tritonserver.dll"),
             os.path.join(install_dir, "bin"),
         )
+        cmake_script.cp(
+            os.path.join(repo_install_dir, "lib", "tritonserver.lib"),
+            os.path.join(install_dir, "bin"),
+        )
     else:
         cmake_script.mkdir(os.path.join(install_dir, "bin"))
         cmake_script.cp(
diff --git a/deploy/gke-marketplace-app/README.md b/deploy/gke-marketplace-app/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -172,7 +172,7 @@ The client example push about ~650 QPS(Query per second) to Triton Server, and w
 ![Locust Client Chart](client.png)
 
 Alternatively, user can opt to use
-[Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+[Perf Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
 to profile and study the performance of Triton Inference Server. Here we also
 provide a
 [client script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh)
diff --git a/deploy/k8s-onprem/README.md b/deploy/k8s-onprem/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -295,7 +295,7 @@ Image 'images/mug.jpg':
 After you have confirmed that your Triton cluster is operational and can perform inference,
 you can test the load balancing and autoscaling features by sending a heavy load of requests.
 One option for doing this is using the
-[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+[perf_analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
 application.
 
 You can apply a progressively increasing load with a command like:
diff --git a/docs/README.md b/docs/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -173,7 +173,7 @@ Understanding Inference performance is key to better resource utilization. Use T
 - [Performance Tuning Guide](user_guide/performance_tuning.md)
 - [Optimization](user_guide/optimization.md)
 - [Model Analyzer](user_guide/model_analyzer.md)
-- [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+- [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
 - [Inference Request Tracing](user_guide/trace.md)
 ### Jetson and JetPack
 Triton can be deployed on edge devices. Explore [resources](user_guide/jetson.md) and [examples](examples/jetson/README.md).
@@ -185,7 +185,7 @@ The following resources are recommended to explore the full suite of Triton Infe
 
 - **Configuring Deployment**: Triton comes with three tools which can be used to configure deployment setting, measure performance and recommend optimizations.
   - [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) Model Analyzer is CLI tool built to recommend deployment configurations for Triton Inference Server based on user's Quality of Service Requirements. It also generates detailed reports about model performance to summarize the benefits and trade offs of different configurations.
-  - [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md):
+  - [Perf Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md):
   Perf Analyzer is a CLI application built to generate inference requests and
   measures the latency of those requests and throughput of the model being
   served.
diff --git a/docs/contents.md b/docs/contents.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -119,17 +119,24 @@ client/src/grpc_generated/java/README
 :maxdepth: 1
 :caption: Performance Analyzer
 
-client/src/c++/perf_analyzer/README
-client/src/c++/perf_analyzer/docs/README
-client/src/c++/perf_analyzer/docs/install
-client/src/c++/perf_analyzer/docs/quick_start
-client/src/c++/perf_analyzer/docs/cli
-client/src/c++/perf_analyzer/docs/inference_load_modes
-client/src/c++/perf_analyzer/docs/input_data
-client/src/c++/perf_analyzer/docs/measurements_metrics
-client/src/c++/perf_analyzer/docs/benchmarking
-client/src/c++/perf_analyzer/genai-perf/README
-client/src/c++/perf_analyzer/genai-perf/examples/tutorial
+perf_analyzer/README
+perf_analyzer/docs/README
+perf_analyzer/docs/install
+perf_analyzer/docs/quick_start
+perf_analyzer/docs/cli
+perf_analyzer/docs/inference_load_modes
+perf_analyzer/docs/input_data
+perf_analyzer/docs/measurements_metrics
+perf_analyzer/docs/benchmarking
+perf_analyzer/genai-perf/README
+perf_analyzer/genai-perf/docs/compare
+perf_analyzer/genai-perf/docs/embeddings
+perf_analyzer/genai-perf/docs/files
+perf_analyzer/genai-perf/docs/lora
+perf_analyzer/genai-perf/docs/multi_modal
+perf_analyzer/genai-perf/docs/rankings
+perf_analyzer/genai-perf/docs/tutorial
+perf_analyzer/genai-perf/examples/tutorial
 ```
 
 ```{toctree}
diff --git a/docs/customization_guide/build.md b/docs/customization_guide/build.md
@@ -331,13 +331,13 @@ invocation builds all features and backends available on windows.
 python build.py --cmake-dir=<path/to/repo>/build --build-dir=/tmp/citritonbuild --no-container-pull --image=base,win10-py3-min --enable-logging --enable-stats --enable-tracing --enable-gpu --endpoint=grpc --endpoint=http --repo-tag=common:<container tag> --repo-tag=core:<container tag> --repo-tag=backend:<container tag> --repo-tag=thirdparty:<container tag> --backend=ensemble --backend=tensorrt:<container tag> --backend=onnxruntime:<container tag> --backend=openvino:<container tag>
 ```
 
-If you are building on *main* branch then '<container tag>' will
+If you are building on *main* branch then `<container tag>` will
 default to "main". If you are building on a release branch then
-'<container tag>' will default to the branch name. For example, if you
-are building on the r24.07 branch, '<container tag>' will default to
-r24.07. Therefore, you typically do not need to provide '<container
-tag>' at all (nor the preceding colon). You can use a different
-'<container tag>' for a component to instead use the corresponding
+`<container tag>` will default to the branch name. For example, if you
+are building on the r24.07 branch, `<container tag>` will default to
+r24.07. Therefore, you typically do not need to provide `<container
+tag>` at all (nor the preceding colon). You can use a different
+`<container tag>` for a component to instead use the corresponding
 branch/tag in the build. For example, if you have a branch called
 "mybranch" in the
 [onnxruntime_backend](https://github.com/triton-inference-server/onnxruntime_backend)
diff --git a/docs/customization_guide/test.md b/docs/customization_guide/test.md
@@ -48,7 +48,7 @@ $ ./gen_qa_model_repository
 $ ./gen_qa_custom_ops
 ```
 
-This will create multiple model repositories in /tmp/<version>/qa_*
+This will create multiple model repositories in /tmp/\<version\>/qa_*
 (for example /tmp/24.07/qa_model_repository).  The TensorRT models
 will be created for the GPU on the system that CUDA considers device 0
 (zero). If you have multiple GPUs on your system see the documentation
@@ -57,14 +57,17 @@ in the scripts for how to target a specific GPU.
 ## Build SDK Image
 
 Build the *tritonserver_sdk* image that contains the client
-libraries, model analyzer, and examples using the following
-commands. You must first checkout the <client branch> branch of the
-*client* repo into the clientrepo/ subdirectory. Typically you want to
-set <client branch> to be the same as your current server branch.
+libraries, model analyzer, perf analyzer and examples using the following
+commands. You must first checkout the `<client branch>` branch of the
+*client* repo into the clientrepo/ subdirectory and the `<perf analyzer branch>`
+branch of the *perf_analyzer* repo into the perfanalyzerrepo/ subdirectory
+respectively. Typically you want to set both `<client branch>` and `<perf analyzer branch>`
+to be the same as your current server branch.
 
 ```
 $ cd <server repo root>
 $ git clone --single-branch --depth=1 -b <client branch> https://github.com/triton-inference-server/client.git clientrepo
+$ git clone --single-branch --depth=1 -b <perf analyzer branch> https://github.com/triton-inference-server/perf_analyzer.git perfanalyzerrepo
 $ docker build -t tritonserver_sdk -f Dockerfile.sdk .
 ```
 
diff --git a/docs/examples/jetson/README.md b/docs/examples/jetson/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -53,7 +53,7 @@ Inference Server as a shared library.
 ## Part 2. Analyzing model performance with perf_analyzer
 
 To analyze model performance on Jetson,
-[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+[perf_analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
 tool is used. The `perf_analyzer` is included in the release tar file or can be
 compiled from source.
 
@@ -65,4 +65,4 @@ From this directory of the repository, execute the following to evaluate model p
 
 In the example above we saved the results as a `.csv` file. To visualize these
 results, follow the steps described
-[here](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md).
+[here](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md).
diff --git a/docs/generate_docs.py b/docs/generate_docs.py
@@ -388,6 +388,10 @@ def main():
     if "client" in repo_tags:
         clone_from_github("client", repo_tags["client"], github_org)
 
+    # Usage generate_docs.py --repo-tag=perf_analyzer:main
+    if "perf_analyzer" in repo_tags:
+        clone_from_github("perf_analyzer", repo_tags["perf_analyzer"], github_org)
+
     # Usage generate_docs.py --repo-tag=python_backend:main
     if "python_backend" in repo_tags:
         clone_from_github("python_backend", repo_tags["python_backend"], github_org)
diff --git a/docs/user_guide/debugging_guide.md b/docs/user_guide/debugging_guide.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -59,7 +59,7 @@ Before proceeding, please see if the model configuration documentation [here](./
     - [Custom_models](https://github.com/triton-inference-server/server/tree/main/qa/custom_models), [ensemble_models](https://github.com/triton-inference-server/server/tree/main/qa/ensemble_models), and [python_models](https://github.com/triton-inference-server/server/tree/main/qa/python_models) include examples of configs for their respective use cases.
     - [L0_model_config](https://github.com/triton-inference-server/server/tree/main/qa/L0_model_config) tests many types of incomplete model configs.
 
-Note that if you are running into an issue with [perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/README.md) or [Model Analyzer](https://github.com/triton-inference-server/model_analyzer), try loading the model onto Triton directly. This checks if the configuration is incorrect or the perf_analyzer or Model Analyzer options need to be updated.
+Note that if you are running into an issue with [perf_analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md) or [Model Analyzer](https://github.com/triton-inference-server/model_analyzer), try loading the model onto Triton directly. This checks if the configuration is incorrect or the perf_analyzer or Model Analyzer options need to be updated.
 
 ## Model Issues
 **Step 1. Run Models Outside of Triton**
diff --git a/docs/user_guide/faq.md b/docs/user_guide/faq.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -99,7 +99,7 @@ available through the [HTTP/REST, GRPC, and C
 APIs](../customization_guide/inference_protocols.md).
 
 A client application,
-[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md),
+[perf_analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md),
 allows you to measure the performance of an individual model using a synthetic
 load. The perf_analyzer application is designed to show you the tradeoff of
 latency vs. throughput.
diff --git a/docs/user_guide/jetson.md b/docs/user_guide/jetson.md
@@ -201,7 +201,7 @@ tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to
 ```
 
 **Note**:
-[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+[perf_analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
 is supported on Jetson, while the [model_analyzer](model_analyzer.md) is
 currently not available for Jetson. To execute `perf_analyzer` for C API, use
 the CLI flag `--service-kind=triton_c_api`:
diff --git a/docs/user_guide/model_analyzer.md b/docs/user_guide/model_analyzer.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -30,7 +30,7 @@
 
 The Triton [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
  is a tool that uses
-[Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+[Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
 to send requests to your model while measuring GPU memory and compute
 utilization. The Model Analyzer is specifically useful for characterizing the
 GPU memory requirements for your model under different batching and model
diff --git a/docs/user_guide/model_configuration.md b/docs/user_guide/model_configuration.md
@@ -934,7 +934,7 @@ dynamic batcher configurations.
 ```
 
 * Use the
-  [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+  [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
   to determine the latency and throughput provided by the default dynamic
   batcher configuration.
 
diff --git a/docs/user_guide/optimization.md b/docs/user_guide/optimization.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2019-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -44,7 +44,7 @@ single GPU.
 Unless you already have a client application suitable for measuring
 the performance of your model on Triton, you should familiarize
 yourself with
-[Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md).
+[Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md).
 The Performance Analyzer is an essential tool for optimizing your model's
 performance.
 
diff --git a/docs/user_guide/perf_analyzer.md b/docs/user_guide/perf_analyzer.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,4 +27,4 @@
 -->
 
 Perf Analyzer documentation has been relocated to
-[here](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md).
+[here](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md).
diff --git a/docs/user_guide/performance_tuning.md b/docs/user_guide/performance_tuning.md
@@ -73,7 +73,7 @@ For additional material, see the
     verify that we can run inference requests and get a baseline performance
     benchmark of your model.
     Triton's
-    [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+    [Perf Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
     tool specifically fits this purpose. Here is a simplified output for
     demonstration purposes:
 
@@ -103,7 +103,7 @@ For additional material, see the
     There are many variables that can be tweaked just within your model
     configuration (`config.pbtxt`) to obtain different results.
     - As your model, config, or use case evolves,
-    [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+    [Perf Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md)
     is a great tool to quickly verify model functionality and performance.
 
 3. How can I improve my model performance?
diff --git a/qa/L0_backend_python/response_sender/response_sender_complete_final_test.py b/qa/L0_backend_python/response_sender/response_sender_complete_final_test.py
diff --git a/qa/L0_backend_python/response_sender/test.sh b/qa/L0_backend_python/response_sender/test.sh
diff --git a/qa/L0_client_build_variants/test.sh b/qa/L0_client_build_variants/test.sh
diff --git a/qa/L0_perf_analyzer_doc_links/test.sh b/qa/L0_perf_analyzer_doc_links/test.sh
diff --git a/qa/python_models/response_sender_complete_final/config.pbtxt b/qa/python_models/response_sender_complete_final/config.pbtxt
diff --git a/qa/python_models/response_sender_complete_final/model.py b/qa/python_models/response_sender_complete_final/model.py