diff --git a/docs/source/blogs/H100vsA100.md b/docs/source/blogs/H100vsA100.md index 06edd816202..9359863b548 100644 --- a/docs/source/blogs/H100vsA100.md +++ b/docs/source/blogs/H100vsA100.md @@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1 -The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html) +The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html) Stay tuned for a highlight on Llama coming soon! diff --git a/docs/source/blogs/H200launch.md b/docs/source/blogs/H200launch.md index 6fd0737c33d..39463990368 100644 --- a/docs/source/blogs/H200launch.md +++ b/docs/source/blogs/H200launch.md @@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news *(1) Largest batch supported on given TP configuration by power of 2.* *(2) TP = Tensor Parallelism* -Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html). +Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html). ### H200 vs H100 diff --git a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md index f0d7647d001..fef8dcc93a2 100644 --- a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md +++ b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md @@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments. -For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html). +For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html). ### Triton Inference Server diff --git a/docs/source/features/disagg-serving.md b/docs/source/features/disagg-serving.md index 8af2c188a5c..56267208975 100644 --- a/docs/source/features/disagg-serving.md +++ b/docs/source/features/disagg-serving.md @@ -186,7 +186,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments. -For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html). +For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html). ## Environment Variables diff --git a/docs/source/index.rst b/docs/source/index.rst index 0389ebd489e..54fd218afdf 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -74,6 +74,7 @@ Welcome to TensorRT LLM's Documentation! features/checkpoint-loading.md features/auto_deploy/auto-deploy.md + .. toctree:: :maxdepth: 2 :caption: Developer Guide diff --git a/examples/llm-api/extra-llm-api-config.yml b/examples/llm-api/extra-llm-api-config.yml new file mode 100644 index 00000000000..120cfea82e4 --- /dev/null +++ b/examples/llm-api/extra-llm-api-config.yml @@ -0,0 +1,5 @@ +cuda_graph_config: + enable_padding: True + max_batch_size: 16 +moe_config: + backend: trtllm diff --git a/examples/models/core/multimodal/README.md b/examples/models/core/multimodal/README.md index d001424bfc8..d92ec168bb8 100644 --- a/examples/models/core/multimodal/README.md +++ b/examples/models/core/multimodal/README.md @@ -901,7 +901,7 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as ` ## NeVA -[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/multimodal/mllm/neva.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM. +[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/neva.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM. 1. Generate TRT-LLM engine for NVGPT following example in `examples/models/core/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`. diff --git a/examples/sample_weight_stripping/README.md b/examples/sample_weight_stripping/README.md index a427dd3df45..dcc7d754f79 100644 --- a/examples/sample_weight_stripping/README.md +++ b/examples/sample_weight_stripping/README.md @@ -241,7 +241,7 @@ python3 ../summarize.py --engine_dir engines/llama2-70b-hf-fp8-tp2.refit \ ## Prototype ### Checkpoint Pruner -The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/latest/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%. +The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/0.21.0/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%. When building an engine with a pruned checkpoint, TensorRT LLM fills in the missing weights with random ones. These weights should later be [refit](#engine-refitter) with the original weights to preserve the intended behavior.