[TRTLLM-9762] [doc] Update documents for GB300 NVL72 (NVIDIA#9987)

kaiyux · web-flow · commit 0788635d6cf6 · 2025-12-14T19:30:28.000-08:00
Signed-off-by: Kaiyu Xie &lt;26294424+kaiyux@users.noreply.github.com&gt;
diff --git a/docs/source/legacy/reference/support-matrix.md b/docs/source/legacy/reference/support-matrix.md
@@ -133,6 +133,7 @@ In addition, older architectures can have limitations for newer software release
 * - GPU Model Architectures
   -
     - [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)
+    - [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
     - [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
     - [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)
     - [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)
diff --git a/docs/source/overview.md b/docs/source/overview.md
@@ -4,7 +4,7 @@
 
 ## About TensorRT LLM
 
-[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs. 
+[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.
 
 ## Key Capabilities
 
@@ -40,7 +40,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
 ### 🚀 **Advanced Optimization & Production Features**
 - **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
 - **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
-- **[Advanced Quantization](./features/quantization.md)**: 
+- **[Advanced Quantization](./features/quantization.md)**:
   - **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
   - **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
 - **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
@@ -54,7 +54,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
 ### 🔧 **Latest GPU Architecture Support**
 
 TensorRT LLM supports the full spectrum of NVIDIA GPU architectures:
-- **NVIDIA Blackwell**: B200, GB200, RTX Pro 6000 SE with FP4 optimization
+- **NVIDIA Blackwell**: B200, GB200, B300, GB300, and RTX Pro 6000 SE with FP4 optimization
 - **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration
 - **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration
 - **NVIDIA Ampere**: A100, RTX 30 series for production workloads
diff --git a/examples/disaggregated/slurm/benchmark/README.md b/examples/disaggregated/slurm/benchmark/README.md
@@ -31,7 +31,7 @@ slurm:
   job_name: "<job_name>"
   extra_args: ""  # Additional SLURM arguments (e.g., "--gres=gpu:4 --exclude=node1")
   set_segment: true # Optional: whether to set the segment for the job
-  numa_bind: true  # Enable NUMA binding for GB200 NVL72
+  numa_bind: true  # Enable NUMA binding for GB200/GB300 NVL72
 ```
 
 ### 2. Benchmark Configuration
diff --git a/examples/disaggregated/slurm/benchmark/config.yaml b/examples/disaggregated/slurm/benchmark/config.yaml
@@ -7,7 +7,7 @@ slurm:
   job_name: "<job_name>"
   extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
   set_segment: true # Optional: whether to set the segment for the job
-  numa_bind: true # Only enable for GB200 NVL72
+  numa_bind: true # Only enable for GB200/GB300 NVL72
 
 # Benchmark Mode
 benchmark:
diff --git a/examples/disaggregated/slurm/benchmark/start_worker.sh b/examples/disaggregated/slurm/benchmark/start_worker.sh
@@ -27,10 +27,10 @@ done
 
 if [ "${numa_bind}" = "true" ]; then
     numa_bind_cmd="numactl -m 0,1"
-    echo "numactl -m 0,1 - Only allocate memory from nodes on GB200"
+    echo "numactl -m 0,1 - Only allocate memory from nodes on GB200/GB300 NVL72"
 else
     numa_bind_cmd=""
-    echo "Not binding memory. If on GB200, use \"numactl -m 0,1\" to only allocate memory from nodes."
+    echo "Not binding memory. If on GB200/GB300 NVL72, use \"numactl -m 0,1\" to only allocate memory from nodes."
 fi
 
 if [ "${benchmark_mode}" = "gen_only" ]; then
diff --git a/examples/wide_ep/README.md b/examples/wide_ep/README.md
@@ -21,13 +21,13 @@ Wide-EP solves these challenges through:
 
 ### Prerequisites
 
-* GPU: GB200 NVL72, H20, or RTX 6000D.
+* GPU: GB200 NVL72, GB300 NVL72, H20, or RTX 6000D.
 * OS: Linux
 * Drivers: CUDA Driver 575 or Later
 * Docker with NVIDIA Container Toolkit installed
 * Python3 and python3-pip (Optional, for accuracy evaluation only)
 
-For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
+For GB200/GB300 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
 
 For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.
 
@@ -108,16 +108,16 @@ If `never` is highlighted, enable Transparent HugePages by the following command
 echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
 ```
 
-### GB200 NUMA binding
+### GB200/GB300 NVL72 NUMA binding
 
-GPU memory is also on NUMA nodes on GB200 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
+GPU memory is also on NUMA nodes on GB200/GB300 NVL72 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
 ```bash
 numactl -m 0,1 <command>
 ```
 
 ### Shared Memory on EPLB
 
-To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory.
+To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200/GB300 NVL72 node share the same expert weights to save memory.
 
 There is one environment variable `TRTLLM_EPLB_SHM_NAME` to specify the base name of the shared memory. This environment variable may need to be specified if there are multiple instances on one node. If not, you can ignore it.
 
diff --git a/examples/wide_ep/slurm_scripts/README.md b/examples/wide_ep/slurm_scripts/README.md
@@ -51,7 +51,7 @@ Before running benchmarks, ensure you have:
 1. **SLURM Cluster Access**: Valid account and partition allocation
 2. **Container Environment**:
    - NVIDIA Container Toolkit configured
-   - Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200, `/dev/gdrdrv` for GDRCopy)
+   - Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200/GB300 NVL72, `/dev/gdrdrv` for GDRCopy)
 3. **Model Files**: Checkpoint files accessible from all cluster nodes
 4. **Configuration**: Updated `config.yaml` with your cluster-specific settings
 
diff --git a/examples/wide_ep/slurm_scripts/config.yaml b/examples/wide_ep/slurm_scripts/config.yaml
@@ -6,7 +6,7 @@ slurm:
   job_time: "02:00:00"
   job_name: "<job_name>"
   extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
-  numa_bind: true # Only enable for GB200 NVL72
+  numa_bind: true # Only enable for GB200/GB300 NVL72
 
 # Benchmark Mode
 benchmark:

Original file line number	Diff line number	Diff line change
`@@ -133,6 +133,7 @@ In addition, older architectures can have limitations for newer software release`
`133`	`133`	`* - GPU Model Architectures`
`134`	`134`	`-`
`135`	`135`	`- [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)`
	`136`	`+ - [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)`
`136`	`137`	`- [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)`
`137`	`138`	`- [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)`
`138`	`139`	`- [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)`