Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Triton Inference Server is an open source inference serving software that
streamlines AI inferencing. Triton enables teams to deploy any AI model from
multiple deep learning and machine learning frameworks, including TensorRT,
TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
Inference Server supports inference across cloud, data center, edge and embedded
Inference Server supports inference across cloud, data center, edge, and embedded
devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference
Server delivers optimized performance for many query types, including real time,
batched, ensembles and audio/video streaming. Triton inference Server is part of
Expand Down
8 changes: 4 additions & 4 deletions deploy/alibaba-cloud/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,15 +152,15 @@ You will get the following result by running the python script:
[10] Avg rt(ms): 34.27
```
# Additional Resources
See the following resources to learn more about how to use Alibaba Cloud's OSS orEAS.
See the following resources to learn more about how to use Alibaba Cloud's OSS or EAS.
- [Alibaba Cloud OSS's Document](https://help.aliyun.com/product/31815.html?spm=a2c4g.11186623.6.540.3c0f62e7q3jw8b)


# Known Issues
- [Binary Tensor Data Extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) is not fully supported yet, users want to use service with binary extension supported, it is only available in cn-shanghai region of PAI-EAS.
- Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTP/2 will be officially supported in a short time.
- Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will no be as expected.
- Not all of Triton Server parameters are be supported on EAS, the following params are supported on EAS:
- Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTTP/2 will be officially supported in a short time.
- Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will not be as expected.
- Not all of Triton Server parameters are supported on EAS, the following params are supported on EAS:
```
model-repository
log-verbose
Expand Down
8 changes: 4 additions & 4 deletions deploy/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ This guide assumes you already have a functional Kubernetes cluster
and helm installed (see below for instructions on installing
helm). Note the following requirements:

* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prpmetheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services.
* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prometheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services.

* If you want Triton Server to use GPUs for inferencing, your cluster
must be configured to contain the desired number of GPU nodes (EC2 G4 instances recommended)
Expand Down Expand Up @@ -113,7 +113,7 @@ To load the model from the AWS S3, you need to convert the following AWS credent
echo -n 'REGION' | base64
```
```
echo -n 'SECRECT_KEY_ID' | base64
echo -n 'SECRET_KEY_ID' | base64
```
```
echo -n 'SECRET_ACCESS_KEY' | base64
Expand Down Expand Up @@ -189,7 +189,7 @@ $ cat << EOF > config.yaml
namespace: MyCustomNamespace
image:
imageName: nvcr.io/nvidia/tritonserver:custom-tag
modelRepositoryPath: gs://my_model_repository
modelRepositoryPath: s3://my_model_repository
EOF
$ helm install example -f config.yaml .
```
Expand Down Expand Up @@ -258,5 +258,5 @@ You may also want to delete the AWS bucket you created to hold the
model repository.

```
$ aws s3 rm -r gs://triton-inference-server-repository
$ aws s3 rm -r s3://triton-inference-server-repository
```
2 changes: 1 addition & 1 deletion deploy/fleetcommand/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ section when creating the Fleet Command Deployment.

```
echo -n 'REGION' | base64
echo -n 'SECRECT_KEY_ID' | base64
echo -n 'SECRET_KEY_ID' | base64
echo -n 'SECRET_ACCESS_KEY' | base64
# Optional for using session token
echo -n 'AWS_SESSION_TOKEN' | base64
Expand Down
2 changes: 1 addition & 1 deletion deploy/gcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ can access the model repository. If the bucket is public then no
additional changes are needed and you can proceed to "Deploy
Prometheus and Grafana" section.

If bucket premissions need to be set with the
If bucket permissions need to be set with the
GOOGLE_APPLICATION_CREDENTIALS environment variable then perform the
following steps:

Expand Down
2 changes: 1 addition & 1 deletion deploy/gke-marketplace-app/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ Where <xx.yy> is the version of NGC Triton container needed.

![GKE Marketplace Application UI](ui.png)

We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA rrequirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems).
We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA requirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems).

Once the application is deployed successfully, get the public ip from ingress:
```
Expand Down
2 changes: 1 addition & 1 deletion deploy/mlflow-triton-plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ To delete a MLflow deployment using CLI
mlflow deployments delete -t triton --name model_name
```

To delete a MLflow deployment using CLI
To delete a MLflow deployment using Python API

```
from mlflow.deployments import get_deploy_client
Expand Down
2 changes: 1 addition & 1 deletion deploy/oci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ To load the model from the OCI Object Storage Bucket, you need to convert the fo
echo -n 'REGION' | base64
```
```
echo -n 'SECRECT_KEY_ID' | base64
echo -n 'SECRET_KEY_ID' | base64
```
```
echo -n 'SECRET_ACCESS_KEY' | base64
Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Triton Inference Server has a considerable list versatile and powerful features.
The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following:
* Creating a Model Repository [[Overview](README.md#model-repository) || [Details](user_guide/model_repository.md)]
* Writing a Model Configuration [[Overview](README.md#model-configuration) || [Details](user_guide/model_configuration.md)]
* Buillding a Model Pipeline [[Overview](README.md#model-pipeline)]
* Building a Model Pipeline [[Overview](README.md#model-pipeline)]
* Managing Model Availability [[Overview](README.md#model-management) || [Details](user_guide/model_management.md)]
* Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)]
* Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)]
Expand Down Expand Up @@ -169,7 +169,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t
### Cancelling Inference Requests
Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
### Performance Analysis
Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.
Understanding Inference performance is key to better resource utilization. Use Triton's Tools to customize your deployment.
- [Performance Tuning Guide](user_guide/performance_tuning.md)
- [Optimization](user_guide/optimization.md)
- [Model Analyzer](user_guide/model_analyzer.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@

# Deploying Phi-3 Model with Triton and TRT-LLM

This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows a shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.
This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.

This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.

Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@

# Quickstart

**New to Triton Inference Server and want do just deploy your model quickly?**
**New to Triton Inference Server and want to just deploy your model quickly?**
Make use of
[these tutorials](https://github.com/triton-inference-server/tutorials#quick-deploy)
to begin your Triton journey!
Expand Down