triton-inference-server · cluster2600 · Jul 23, 2025 · Aug 3, 2025
diff --git a/README.md b/README.md
@@ -38,7 +38,7 @@ Triton Inference Server is an open source inference serving software that
 streamlines AI inferencing. Triton enables teams to deploy any AI model from
 multiple deep learning and machine learning frameworks, including TensorRT,
 TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
-Inference Server supports inference across cloud, data center, edge and embedded
+Inference Server supports inference across cloud, data center, edge, and embedded
 devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference
 Server delivers optimized performance for many query types, including real time,
 batched, ensembles and audio/video streaming. Triton inference Server is part of

diff --git a/deploy/alibaba-cloud/README.md b/deploy/alibaba-cloud/README.md
@@ -152,15 +152,15 @@ You will get the following result by running the python script:
 [10] Avg rt(ms): 34.27
 ```
 # Additional Resources
-See the following resources to learn more about how to use Alibaba Cloud's OSS orEAS.
+See the following resources to learn more about how to use Alibaba Cloud's OSS or EAS.
 - [Alibaba Cloud OSS's Document](https://help.aliyun.com/product/31815.html?spm=a2c4g.11186623.6.540.3c0f62e7q3jw8b)
 
 
 # Known Issues
 - [Binary Tensor Data Extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) is not fully supported yet, users want to use service with binary extension supported, it is only available in cn-shanghai region of PAI-EAS.
-- Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTP/2 will be officially supported in a short time.
-- Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will no be as expected.
-- Not all of Triton Server parameters are be supported on EAS, the following params are supported on EAS:
+- Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTTP/2 will be officially supported in a short time.
+- Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will not be as expected.
+- Not all of Triton Server parameters are supported on EAS, the following params are supported on EAS:
 ```
 model-repository
 log-verbose

diff --git a/deploy/aws/README.md b/deploy/aws/README.md
@@ -39,7 +39,7 @@ This guide assumes you already have a functional Kubernetes cluster
 and helm installed (see below for instructions on installing
 helm). Note the following requirements:
 
-* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prpmetheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services.
+* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prometheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services.
 
 * If you want Triton Server to use GPUs for inferencing, your cluster
 must be configured to contain the desired number of GPU nodes (EC2 G4 instances recommended)
@@ -113,7 +113,7 @@ To load the model from the AWS S3, you need to convert the following AWS credent
 echo -n 'REGION' | base64
 ```
 ```
-echo -n 'SECRECT_KEY_ID' | base64
+echo -n 'SECRET_KEY_ID' | base64
 ```
 ```
 echo -n 'SECRET_ACCESS_KEY' | base64
@@ -189,7 +189,7 @@ $ cat << EOF > config.yaml
 namespace: MyCustomNamespace
 image:
   imageName: nvcr.io/nvidia/tritonserver:custom-tag
-  modelRepositoryPath: gs://my_model_repository
+  modelRepositoryPath: s3://my_model_repository
 EOF
 $ helm install example -f config.yaml .
 ```
@@ -258,5 +258,5 @@ You may also want to delete the AWS bucket you created to hold the
 model repository.
 
 ```
-$ aws s3 rm -r gs://triton-inference-server-repository
+$ aws s3 rm -r s3://triton-inference-server-repository
 ```
diff --git a/deploy/fleetcommand/README.md b/deploy/fleetcommand/README.md
@@ -77,7 +77,7 @@ section when creating the Fleet Command Deployment.
 
 ```
 echo -n 'REGION' | base64
-echo -n 'SECRECT_KEY_ID' | base64
+echo -n 'SECRET_KEY_ID' | base64
 echo -n 'SECRET_ACCESS_KEY' | base64
 # Optional for using session token
 echo -n 'AWS_SESSION_TOKEN' | base64

diff --git a/deploy/gcp/README.md b/deploy/gcp/README.md
@@ -118,7 +118,7 @@ can access the model repository. If the bucket is public then no
 additional changes are needed and you can proceed to "Deploy
 Prometheus and Grafana" section.
 
-If bucket premissions need to be set with the
+If bucket permissions need to be set with the
 GOOGLE_APPLICATION_CREDENTIALS environment variable then perform the
 following steps:
 

diff --git a/deploy/gke-marketplace-app/README.md b/deploy/gke-marketplace-app/README.md
@@ -151,7 +151,7 @@ Where <xx.yy> is the version of NGC Triton container needed.
 
 ![GKE Marketplace Application UI](ui.png)
 
-We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA rrequirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems).
+We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA requirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems).
 
 Once the application is deployed successfully, get the public ip from ingress:
 ```

diff --git a/deploy/mlflow-triton-plugin/README.md b/deploy/mlflow-triton-plugin/README.md
@@ -171,7 +171,7 @@ To delete a MLflow deployment using CLI
 mlflow deployments delete -t triton --name model_name
 ```
 
-To delete a MLflow deployment using CLI
+To delete a MLflow deployment using Python API
 
 ```
 from mlflow.deployments import get_deploy_client

diff --git a/deploy/oci/README.md b/deploy/oci/README.md
@@ -132,7 +132,7 @@ To load the model from the OCI Object Storage Bucket, you need to convert the fo
 echo -n 'REGION' | base64
 ```
 ```
-echo -n 'SECRECT_KEY_ID' | base64
+echo -n 'SECRET_KEY_ID' | base64
 ```
 ```
 echo -n 'SECRET_ACCESS_KEY' | base64

diff --git a/docs/README.md b/docs/README.md
@@ -64,7 +64,7 @@ Triton Inference Server has a considerable list versatile and powerful features.
 The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following:
 * Creating a Model Repository [[Overview](README.md#model-repository) || [Details](user_guide/model_repository.md)]
 * Writing a Model Configuration [[Overview](README.md#model-configuration) || [Details](user_guide/model_configuration.md)]
-* Buillding a Model Pipeline [[Overview](README.md#model-pipeline)]
+* Building a Model Pipeline [[Overview](README.md#model-pipeline)]
 * Managing Model Availability [[Overview](README.md#model-management) || [Details](user_guide/model_management.md)]
 * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)]
 * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)]
@@ -169,7 +169,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t
 ### Cancelling Inference Requests
 Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
 ### Performance Analysis
-Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.
+Understanding Inference performance is key to better resource utilization. Use Triton's Tools to customize your deployment.
 - [Performance Tuning Guide](user_guide/performance_tuning.md)
 - [Optimization](user_guide/optimization.md)
 - [Model Analyzer](user_guide/model_analyzer.md)

diff --git a/docs/getting_started/llm.md b/docs/getting_started/llm.md
@@ -28,7 +28,7 @@
 
 # Deploying Phi-3 Model with Triton and TRT-LLM
 
-This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows a shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.
+This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.
 
 This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.
 

diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md
@@ -28,7 +28,7 @@
 
 # Quickstart
 
-**New to Triton Inference Server and want do just deploy your model quickly?**
+**New to Triton Inference Server and want to just deploy your model quickly?**
 Make use of
 [these tutorials](https://github.com/triton-inference-server/tutorials#quick-deploy)
  to begin your Triton journey!