|
27 | 27 | --> |
28 | 28 | [](https://opensource.org/licenses/BSD-3-Clause) |
29 | 29 |
|
30 | | -# Triton Inference Server |
31 | | - |
32 | | -Triton Inference Server is an open source inference serving software that |
33 | | -streamlines AI inferencing. Triton enables teams to deploy any AI model from |
34 | | -multiple deep learning and machine learning frameworks, including TensorRT, |
35 | | -TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton |
36 | | -Inference Server supports inference across cloud, data center, edge and embedded |
37 | | -devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference |
38 | | -Server delivers optimized performance for many query types, including real time, |
39 | | -batched, ensembles and audio/video streaming. Triton inference Server is part of |
40 | | -[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/), |
41 | | -a software platform that accelerates the data science pipeline and streamlines |
42 | | -the development and deployment of production AI. |
43 | | - |
44 | | -Major features include: |
45 | | - |
46 | | -- [Supports multiple deep learning |
47 | | - frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton) |
48 | | -- [Supports multiple machine learning |
49 | | - frameworks](https://github.com/triton-inference-server/fil_backend) |
50 | | -- [Concurrent model |
51 | | - execution](docs/user_guide/architecture.md#concurrent-model-execution) |
52 | | -- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher) |
53 | | -- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and |
54 | | - [implicit state management](docs/user_guide/architecture.md#implicit-state-management) |
55 | | - for stateful models |
56 | | -- Provides [Backend API](https://github.com/triton-inference-server/backend) that |
57 | | - allows adding custom backends and pre/post processing operations |
58 | | -- Supports writing custom backends in python, a.k.a. |
59 | | - [Python-based backends.](https://github.com/triton-inference-server/backend/blob/r25.06/docs/python_based_backends.md#python-based-backends) |
60 | | -- Model pipelines using |
61 | | - [Ensembling](docs/user_guide/architecture.md#ensemble-models) or [Business |
62 | | - Logic Scripting |
63 | | - (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
64 | | -- [HTTP/REST and GRPC inference |
65 | | - protocols](docs/customization_guide/inference_protocols.md) based on the community |
66 | | - developed [KServe |
67 | | - protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) |
68 | | -- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and |
69 | | - [Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api) |
70 | | - allow Triton to link directly into your application for edge and other in-process use cases |
71 | | -- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server |
72 | | - throughput, server latency, and more |
73 | | - |
74 | | -**New to Triton Inference Server?** Make use of |
75 | | -[these tutorials](https://github.com/triton-inference-server/tutorials) |
76 | | -to begin your Triton journey! |
77 | | - |
78 | | -Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and |
79 | | -stay current on the latest product updates, bug fixes, content, best practices, |
80 | | -and more. Need enterprise support? NVIDIA global support is available for Triton |
81 | | -Inference Server with the |
82 | | -[NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/). |
83 | | - |
84 | | -## Serve a Model in 3 Easy Steps |
85 | | - |
86 | | -```bash |
87 | | -# Step 1: Create the example model repository |
88 | | -git clone -b r25.06 https://github.com/triton-inference-server/server.git |
89 | | -cd server/docs/examples |
90 | | -./fetch_models.sh |
91 | | - |
92 | | -# Step 2: Launch triton from the NGC Triton container |
93 | | -docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:25.06-py3 tritonserver --model-repository=/models --model-control-mode explicit --load-model densenet_onnx |
94 | | - |
95 | | -# Step 3: Sending an Inference Request |
96 | | -# In a separate console, launch the image_client example from the NGC Triton SDK container |
97 | | -docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.06-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg |
98 | | - |
99 | | -# Inference should return the following |
100 | | -Image '/workspace/images/mug.jpg': |
101 | | - 15.346230 (504) = COFFEE MUG |
102 | | - 13.224326 (968) = CUP |
103 | | - 10.422965 (505) = COFFEEPOT |
104 | | -``` |
105 | | -Please read the [QuickStart](docs/getting_started/quickstart.md) guide for additional information |
106 | | -regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/getting_started/quickstart.md#run-on-cpu-only-system). New to Triton and wondering where to get started? Watch the [Getting Started video](https://youtu.be/NQDtfSi5QF4). |
107 | | - |
108 | | -## Examples and Tutorials |
109 | | - |
110 | | -Check out [NVIDIA LaunchPad](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/trial/) |
111 | | -for free access to a set of hands-on labs with Triton Inference Server hosted on |
112 | | -NVIDIA infrastructure. |
113 | | - |
114 | | -Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM |
115 | | -are located in the |
116 | | -[NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples) |
117 | | -page on GitHub. The |
118 | | -[NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-triton-inference-server) |
119 | | -contains additional documentation, presentations, and examples. |
120 | | - |
121 | | -## Documentation |
122 | | - |
123 | | -### Build and Deploy |
124 | | - |
125 | | -The recommended way to build and use Triton Inference Server is with Docker |
126 | | -images. |
127 | | - |
128 | | -- [Install Triton Inference Server with Docker containers](docs/customization_guide/build.md#building-with-docker) (*Recommended*) |
129 | | -- [Install Triton Inference Server without Docker containers](docs/customization_guide/build.md#building-without-docker) |
130 | | -- [Build a custom Triton Inference Server Docker container](docs/customization_guide/compose.md) |
131 | | -- [Build Triton Inference Server from source](docs/customization_guide/build.md#building-on-unsupported-platforms) |
132 | | -- [Build Triton Inference Server for Windows 10](docs/customization_guide/build.md#building-for-windows-10) |
133 | | -- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md), |
134 | | - [AWS](deploy/aws/README.md), and [NVIDIA FleetCommand](deploy/fleetcommand/README.md) |
135 | | -- [Secure Deployment Considerations](docs/customization_guide/deploy.md) |
136 | | - |
137 | | -### Using Triton |
138 | | - |
139 | | -#### Preparing Models for Triton Inference Server |
140 | | - |
141 | | -The first step in using Triton to serve your models is to place one or |
142 | | -more models into a [model repository](docs/user_guide/model_repository.md). Depending on |
143 | | -the type of the model and on what Triton capabilities you want to enable for |
144 | | -the model, you may need to create a [model |
145 | | -configuration](docs/user_guide/model_configuration.md) for the model. |
146 | | - |
147 | | -- [Add custom operations to Triton if needed by your model](docs/user_guide/custom_operations.md) |
148 | | -- Enable model pipelining with [Model Ensemble](docs/user_guide/architecture.md#ensemble-models) |
149 | | - and [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
150 | | -- Optimize your models setting [scheduling and batching](docs/user_guide/architecture.md#models-and-schedulers) |
151 | | - parameters and [model instances](docs/user_guide/model_configuration.md#instance-groups). |
152 | | -- Use the [Model Analyzer tool](https://github.com/triton-inference-server/model_analyzer) |
153 | | - to help optimize your model configuration with profiling |
154 | | -- Learn how to [explicitly manage what models are available by loading and |
155 | | - unloading models](docs/user_guide/model_management.md) |
156 | | - |
157 | | -#### Configure and Use Triton Inference Server |
158 | | - |
159 | | -- Read the [Quick Start Guide](docs/getting_started/quickstart.md) to run Triton Inference |
160 | | - Server on both GPU and CPU |
161 | | -- Triton supports multiple execution engines, called |
162 | | - [backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton), including |
163 | | - [TensorRT](https://github.com/triton-inference-server/tensorrt_backend), |
164 | | - [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend), |
165 | | - [PyTorch](https://github.com/triton-inference-server/pytorch_backend), |
166 | | - [ONNX](https://github.com/triton-inference-server/onnxruntime_backend), |
167 | | - [OpenVINO](https://github.com/triton-inference-server/openvino_backend), |
168 | | - [Python](https://github.com/triton-inference-server/python_backend), and more |
169 | | -- Not all the above backends are supported on every platform supported by Triton. |
170 | | - Look at the |
171 | | - [Backend-Platform Support Matrix](https://github.com/triton-inference-server/backend/blob/r25.06/docs/backend_platform_support_matrix.md) |
172 | | - to learn which backends are supported on your target platform. |
173 | | -- Learn how to [optimize performance](docs/user_guide/optimization.md) using the |
174 | | - [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/r25.06/README.md) |
175 | | - and |
176 | | - [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) |
177 | | -- Learn how to [manage loading and unloading models](docs/user_guide/model_management.md) in |
178 | | - Triton |
179 | | -- Send requests directly to Triton with the [HTTP/REST JSON-based |
180 | | - or gRPC protocols](docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols) |
181 | | - |
182 | | -#### Client Support and Examples |
183 | | - |
184 | | -A Triton *client* application sends inference and other requests to Triton. The |
185 | | -[Python and C++ client libraries](https://github.com/triton-inference-server/client) |
186 | | -provide APIs to simplify this communication. |
187 | | - |
188 | | -- Review client examples for [C++](https://github.com/triton-inference-server/client/blob/r25.06/src/c%2B%2B/examples), |
189 | | - [Python](https://github.com/triton-inference-server/client/blob/r25.06/src/python/examples), |
190 | | - and [Java](https://github.com/triton-inference-server/client/blob/r25.06/src/java/src/main/java/triton/client/examples) |
191 | | -- Configure [HTTP](https://github.com/triton-inference-server/client#http-options) |
192 | | - and [gRPC](https://github.com/triton-inference-server/client#grpc-options) |
193 | | - client options |
194 | | -- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP |
195 | | - request without any additional metadata](https://github.com/triton-inference-server/server/blob/r25.06/docs/protocol/extension_binary_data.md#raw-binary-request) |
196 | | - |
197 | | -### Extend Triton |
198 | | - |
199 | | -[Triton Inference Server's architecture](docs/user_guide/architecture.md) is specifically |
200 | | -designed for modularity and flexibility |
201 | | - |
202 | | -- [Customize Triton Inference Server container](docs/customization_guide/compose.md) for your use case |
203 | | -- [Create custom backends](https://github.com/triton-inference-server/backend) |
204 | | - in either [C/C++](https://github.com/triton-inference-server/backend/blob/r25.06/README.md#triton-backend-api) |
205 | | - or [Python](https://github.com/triton-inference-server/python_backend) |
206 | | -- Create [decoupled backends and models](docs/user_guide/decoupled_models.md) that can send |
207 | | - multiple responses for a request or not send any responses for a request |
208 | | -- Use a [Triton repository agent](docs/customization_guide/repository_agents.md) to add functionality |
209 | | - that operates when a model is loaded and unloaded, such as authentication, |
210 | | - decryption, or conversion |
211 | | -- Deploy Triton on [Jetson and JetPack](docs/user_guide/jetson.md) |
212 | | -- [Use Triton on AWS |
213 | | - Inferentia](https://github.com/triton-inference-server/python_backend/tree/r25.06/inferentia) |
214 | | - |
215 | | -### Additional Documentation |
216 | | - |
217 | | -- [FAQ](docs/user_guide/faq.md) |
218 | | -- [User Guide](docs/README.md#user-guide) |
219 | | -- [Customization Guide](docs/README.md#customization-guide) |
220 | | -- [Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html) |
221 | | -- [GPU, Driver, and CUDA Support |
222 | | -Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html) |
223 | | - |
224 | | -## Contributing |
225 | | - |
226 | | -Contributions to Triton Inference Server are more than welcome. To |
227 | | -contribute please review the [contribution |
228 | | -guidelines](CONTRIBUTING.md). If you have a backend, client, |
229 | | -example or similar contribution that is not modifying the core of |
230 | | -Triton, then you should file a PR in the [contrib |
231 | | -repo](https://github.com/triton-inference-server/contrib). |
232 | | - |
233 | | -## Reporting problems, asking questions |
234 | | - |
235 | | -We appreciate any feedback, questions or bug reporting regarding this project. |
236 | | -When posting [issues in GitHub](https://github.com/triton-inference-server/server/issues), |
237 | | -follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve). |
238 | | -Ensure posted examples are: |
239 | | -- minimal – use as little code as possible that still produces the |
240 | | - same problem |
241 | | -- complete – provide all parts needed to reproduce the problem. Check |
242 | | - if you can strip external dependencies and still show the problem. The |
243 | | - less time we spend on reproducing problems the more time we have to |
244 | | - fix it |
245 | | -- verifiable – test the code you're about to provide to make sure it |
246 | | - reproduces the problem. Remove all other problems that are not |
247 | | - related to your request/question. |
248 | | - |
249 | | -For issues, please use the provided bug report and feature request templates. |
250 | | - |
251 | | -For questions, we recommend posting in our community |
252 | | -[GitHub Discussions.](https://github.com/triton-inference-server/server/discussions) |
253 | | - |
254 | | -## For more information |
255 | | - |
256 | | -Please refer to the [NVIDIA Developer Triton page](https://developer.nvidia.com/nvidia-triton-inference-server) |
257 | | -for more information. |
| 30 | +>[!WARNING] |
| 31 | +> You are currently on the `r25.07` branch which tracks under-development progress towards the next release. |
0 commit comments