diff --git a/README.md b/README.md index 48bd0baf1b..4a4ce9572b 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,7 @@ Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton -Inference Server supports inference across cloud, data center, edge and embedded +Inference Server supports inference across cloud, data center, edge, and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of diff --git a/deploy/alibaba-cloud/README.md b/deploy/alibaba-cloud/README.md index 7b45551f5c..9d5395b813 100644 --- a/deploy/alibaba-cloud/README.md +++ b/deploy/alibaba-cloud/README.md @@ -152,15 +152,15 @@ You will get the following result by running the python script: [10] Avg rt(ms): 34.27 ``` # Additional Resources -See the following resources to learn more about how to use Alibaba Cloud's OSS orEAS. +See the following resources to learn more about how to use Alibaba Cloud's OSS or EAS. - [Alibaba Cloud OSS's Document](https://help.aliyun.com/product/31815.html?spm=a2c4g.11186623.6.540.3c0f62e7q3jw8b) # Known Issues - [Binary Tensor Data Extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) is not fully supported yet, users want to use service with binary extension supported, it is only available in cn-shanghai region of PAI-EAS. -- Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTP/2 will be officially supported in a short time. -- Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will no be as expected. -- Not all of Triton Server parameters are be supported on EAS, the following params are supported on EAS: +- Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTTP/2 will be officially supported in a short time. +- Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will not be as expected. +- Not all of Triton Server parameters are supported on EAS, the following params are supported on EAS: ``` model-repository log-verbose diff --git a/deploy/aws/README.md b/deploy/aws/README.md index af411df772..8430e87120 100644 --- a/deploy/aws/README.md +++ b/deploy/aws/README.md @@ -39,7 +39,7 @@ This guide assumes you already have a functional Kubernetes cluster and helm installed (see below for instructions on installing helm). Note the following requirements: -* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prpmetheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services. +* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prometheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services. * If you want Triton Server to use GPUs for inferencing, your cluster must be configured to contain the desired number of GPU nodes (EC2 G4 instances recommended) @@ -113,7 +113,7 @@ To load the model from the AWS S3, you need to convert the following AWS credent echo -n 'REGION' | base64 ``` ``` -echo -n 'SECRECT_KEY_ID' | base64 +echo -n 'SECRET_KEY_ID' | base64 ``` ``` echo -n 'SECRET_ACCESS_KEY' | base64 @@ -189,7 +189,7 @@ $ cat << EOF > config.yaml namespace: MyCustomNamespace image: imageName: nvcr.io/nvidia/tritonserver:custom-tag - modelRepositoryPath: gs://my_model_repository + modelRepositoryPath: s3://my_model_repository EOF $ helm install example -f config.yaml . ``` @@ -258,5 +258,5 @@ You may also want to delete the AWS bucket you created to hold the model repository. ``` -$ aws s3 rm -r gs://triton-inference-server-repository +$ aws s3 rm -r s3://triton-inference-server-repository ``` diff --git a/deploy/fleetcommand/README.md b/deploy/fleetcommand/README.md index 217162279c..a7199c143f 100644 --- a/deploy/fleetcommand/README.md +++ b/deploy/fleetcommand/README.md @@ -77,7 +77,7 @@ section when creating the Fleet Command Deployment. ``` echo -n 'REGION' | base64 -echo -n 'SECRECT_KEY_ID' | base64 +echo -n 'SECRET_KEY_ID' | base64 echo -n 'SECRET_ACCESS_KEY' | base64 # Optional for using session token echo -n 'AWS_SESSION_TOKEN' | base64 diff --git a/deploy/gcp/README.md b/deploy/gcp/README.md index 6e100df7d5..a9bf090917 100644 --- a/deploy/gcp/README.md +++ b/deploy/gcp/README.md @@ -118,7 +118,7 @@ can access the model repository. If the bucket is public then no additional changes are needed and you can proceed to "Deploy Prometheus and Grafana" section. -If bucket premissions need to be set with the +If bucket permissions need to be set with the GOOGLE_APPLICATION_CREDENTIALS environment variable then perform the following steps: diff --git a/deploy/gke-marketplace-app/README.md b/deploy/gke-marketplace-app/README.md index 595d4634ab..92740aa414 100644 --- a/deploy/gke-marketplace-app/README.md +++ b/deploy/gke-marketplace-app/README.md @@ -151,7 +151,7 @@ Where is the version of NGC Triton container needed. ![GKE Marketplace Application UI](ui.png) -We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA rrequirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems). +We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA requirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems). Once the application is deployed successfully, get the public ip from ingress: ``` diff --git a/deploy/mlflow-triton-plugin/README.md b/deploy/mlflow-triton-plugin/README.md index c011194299..66758c6ec2 100644 --- a/deploy/mlflow-triton-plugin/README.md +++ b/deploy/mlflow-triton-plugin/README.md @@ -171,7 +171,7 @@ To delete a MLflow deployment using CLI mlflow deployments delete -t triton --name model_name ``` -To delete a MLflow deployment using CLI +To delete a MLflow deployment using Python API ``` from mlflow.deployments import get_deploy_client diff --git a/deploy/oci/README.md b/deploy/oci/README.md index 76022264bb..4b3c2c0ecc 100644 --- a/deploy/oci/README.md +++ b/deploy/oci/README.md @@ -132,7 +132,7 @@ To load the model from the OCI Object Storage Bucket, you need to convert the fo echo -n 'REGION' | base64 ``` ``` -echo -n 'SECRECT_KEY_ID' | base64 +echo -n 'SECRET_KEY_ID' | base64 ``` ``` echo -n 'SECRET_ACCESS_KEY' | base64 diff --git a/docs/README.md b/docs/README.md index a9604c0eae..1321aa185c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -64,7 +64,7 @@ Triton Inference Server has a considerable list versatile and powerful features. The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following: * Creating a Model Repository [[Overview](README.md#model-repository) || [Details](user_guide/model_repository.md)] * Writing a Model Configuration [[Overview](README.md#model-configuration) || [Details](user_guide/model_configuration.md)] -* Buillding a Model Pipeline [[Overview](README.md#model-pipeline)] +* Building a Model Pipeline [[Overview](README.md#model-pipeline)] * Managing Model Availability [[Overview](README.md#model-management) || [Details](user_guide/model_management.md)] * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)] * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)] @@ -169,7 +169,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t ### Cancelling Inference Requests Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. ### Performance Analysis -Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment. +Understanding Inference performance is key to better resource utilization. Use Triton's Tools to customize your deployment. - [Performance Tuning Guide](user_guide/performance_tuning.md) - [Optimization](user_guide/optimization.md) - [Model Analyzer](user_guide/model_analyzer.md) diff --git a/docs/getting_started/llm.md b/docs/getting_started/llm.md index cecf565f51..f20fd5dca9 100644 --- a/docs/getting_started/llm.md +++ b/docs/getting_started/llm.md @@ -28,7 +28,7 @@ # Deploying Phi-3 Model with Triton and TRT-LLM -This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows a shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency. +This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency. This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07. diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index c016572535..1095eabc33 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -28,7 +28,7 @@ # Quickstart -**New to Triton Inference Server and want do just deploy your model quickly?** +**New to Triton Inference Server and want to just deploy your model quickly?** Make use of [these tutorials](https://github.com/triton-inference-server/tutorials#quick-deploy) to begin your Triton journey!