@@ -10,51 +10,51 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
1010[ ![ python] ( https://img.shields.io/badge/python-3.10-green )] ( https://www.python.org/downloads/release/python-31012/ )
1111[ ![ cuda] ( https://img.shields.io/badge/cuda-13.0.0-green )] ( https://developer.nvidia.com/cuda-downloads )
1212[ ![ trt] ( https://img.shields.io/badge/TRT-10.13.2-green )] ( https://developer.nvidia.com/tensorrt )
13- [ ![ version] ( https://img.shields.io/badge/release-1.2.0rc3-green )] ( . /tensorrt_llm/version.py)
14- [ ![ license] ( https://img.shields.io/badge/license-Apache%202-blue )] ( . /LICENSE)
13+ [ ![ version] ( https://img.shields.io/badge/release-1.2.0rc3-green )] ( https://github.com/NVIDIA/TensorRT-LLM/blob/main /tensorrt_llm/version.py)
14+ [ ![ license] ( https://img.shields.io/badge/license-Apache%202-blue )] ( https://github.com/NVIDIA/TensorRT-LLM/blob/main /LICENSE)
1515
16- [ Architecture] ( ./docs/source/torch/arch_overview.md )   ;  ;  ; |  ;  ;  ; [ Performance] ( ./docs/source/performance/ perf-overview.md )   ;  ;  ; |  ;  ;  ; [ Examples] ( https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html )   ;  ;  ; |  ;  ;  ; [ Documentation] ( https://nvidia.github.io/TensorRT-LLM/ )   ;  ;  ; |  ;  ;  ; [ Roadmap] ( https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap )
16+ [ Architecture] ( https://nvidia.github.io/TensorRT-LLM/developer-guide/overview.html )   ;  ;  ; |  ;  ;  ; [ Performance] ( https://nvidia.github.io/TensorRT-LLM/developer-guide/ perf-overview.html )   ;  ;  ; |  ;  ;  ; [ Examples] ( https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html )   ;  ;  ; |  ;  ;  ; [ Documentation] ( https://nvidia.github.io/TensorRT-LLM/ )   ;  ;  ; |  ;  ;  ; [ Roadmap] ( https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap )
1717
1818---
1919<div align =" left " >
2020
2121## Tech Blogs
2222
2323* [ 10/13] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
24- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.md )
24+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.html )
2525
2626* [ 09/26] Inference Time Compute Implementation in TensorRT LLM
27- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog13_Inference_Time_Compute_Implementation_in_TensorRT-LLM.md )
27+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog13_Inference_Time_Compute_Implementation_in_TensorRT-LLM.html )
2828
2929* [ 09/19] Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
30- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md )
30+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.html )
3131
3232* [ 08/29] ADP Balance Strategy
33- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog10_ADP_Balance_Strategy.md )
33+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog10_ADP_Balance_Strategy.html )
3434
3535* [ 08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT LLM
36- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md )
36+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.html )
3737
3838* [ 08/01] Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
39- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md )
39+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.html )
4040
4141* [ 07/26] N-Gram Speculative Decoding in TensorRT LLM
42- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog7_NGram_performance_Analysis_And_Auto_Enablement.md )
42+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog7_NGram_performance_Analysis_And_Auto_Enablement.html )
4343
4444* [ 06/19] Disaggregated Serving in TensorRT LLM
45- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md )
45+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html )
4646
4747* [ 06/05] Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
48- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md )
48+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.html )
4949
5050* [ 05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
51- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md )
51+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.html )
5252
5353* [ 05/23] DeepSeek R1 MTP Implementation and Optimization
54- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md )
54+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.html )
5555
5656* [ 05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
57- ✨ [ ➡️ link] ( ./docs/source/ blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md )
57+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.html )
5858
5959## Latest News
6060* [ 08/05] 🌟 TensorRT LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B [ ➡️ link] ( https://huggingface.co/openai/gpt-oss-120b ) and GPT-OSS-20B [ ➡️ link] ( https://huggingface.co/openai/gpt-oss-20b )
@@ -63,11 +63,11 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
6363* [ 05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
6464✨ [ ➡️ link] ( https://developer.nvidia.com/blog/blackwell-breaks-the-1000-tps-user-barrier-with-metas-llama-4-maverick/ )
6565* [ 04/10] TensorRT LLM DeepSeek R1 performance benchmarking best practices now published.
66- ✨ [ ➡️ link] ( ./docs/source/ blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md )
66+ ✨ [ ➡️ link] ( https://nvidia.github.io/TensorRT-LLM/ blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.html )
6767
6868* [ 04/05] TensorRT LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!
6969
70- ![ L4_perf] ( . /docs/source/media/l4_launch_perf.png)
70+ ![ L4_perf] ( https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main /docs/source/media/l4_launch_perf.png)
7171
7272
7373* [ 03/22] TensorRT LLM is now fully open-source, with developments moved to GitHub!
0 commit comments