diff --git a/docs/source/blogs/H100vsA100.md b/docs/source/blogs/H100vsA100.md
index 06edd816202..9359863b548 100644
--- a/docs/source/blogs/H100vsA100.md
+++ b/docs/source/blogs/H100vsA100.md
@@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x
FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1
-The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)
+The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)
Stay tuned for a highlight on Llama coming soon!
diff --git a/docs/source/blogs/H200launch.md b/docs/source/blogs/H200launch.md
index 6fd0737c33d..39463990368 100644
--- a/docs/source/blogs/H200launch.md
+++ b/docs/source/blogs/H200launch.md
@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news
*(1) Largest batch supported on given TP configuration by power of 2.* *(2) TP = Tensor Parallelism*
-Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).
+Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).
### H200 vs H100
diff --git a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
index f0d7647d001..fef8dcc93a2 100644
--- a/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
+++ b/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
-For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
+For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
### Triton Inference Server
diff --git a/docs/source/features/disagg-serving.md b/docs/source/features/disagg-serving.md
index 8af2c188a5c..56267208975 100644
--- a/docs/source/features/disagg-serving.md
+++ b/docs/source/features/disagg-serving.md
@@ -186,7 +186,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
-For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
+For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
## Environment Variables
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 0389ebd489e..54fd218afdf 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -74,6 +74,7 @@ Welcome to TensorRT LLM's Documentation!
features/checkpoint-loading.md
features/auto_deploy/auto-deploy.md
+
.. toctree::
:maxdepth: 2
:caption: Developer Guide
diff --git a/examples/llm-api/extra-llm-api-config.yml b/examples/llm-api/extra-llm-api-config.yml
new file mode 100644
index 00000000000..120cfea82e4
--- /dev/null
+++ b/examples/llm-api/extra-llm-api-config.yml
@@ -0,0 +1,5 @@
+cuda_graph_config:
+ enable_padding: True
+ max_batch_size: 16
+moe_config:
+ backend: trtllm
diff --git a/examples/models/core/multimodal/README.md b/examples/models/core/multimodal/README.md
index d001424bfc8..d92ec168bb8 100644
--- a/examples/models/core/multimodal/README.md
+++ b/examples/models/core/multimodal/README.md
@@ -901,7 +901,7 @@ Note that for instruct Vision model, please set the `max_encoder_input_len` as `
## NeVA
-[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/multimodal/mllm/neva.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
+[NeVA](https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/neva.html) is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models with a vision encoder, that can be deployed in TensorRT-LLM.
1. Generate TRT-LLM engine for NVGPT following example in `examples/models/core/gpt/README.md`. To adhere to the NVGPT conventions of the conversion script, some layer keys have to be remapped using `--nemo_rename_key`.
diff --git a/examples/sample_weight_stripping/README.md b/examples/sample_weight_stripping/README.md
index a427dd3df45..dcc7d754f79 100644
--- a/examples/sample_weight_stripping/README.md
+++ b/examples/sample_weight_stripping/README.md
@@ -241,7 +241,7 @@ python3 ../summarize.py --engine_dir engines/llama2-70b-hf-fp8-tp2.refit \
## Prototype
### Checkpoint Pruner
-The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/latest/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%.
+The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/0.21.0/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%.
When building an engine with a pruned checkpoint, TensorRT LLM fills in the missing weights with random ones. These weights should later be [refit](#engine-refitter) with the original weights to preserve the intended behavior.