You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/llm/README.md
+25-4Lines changed: 25 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,11 @@
1
1
# Optimizing LLMs in Torch-TensorRT
2
2
3
-
This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry point is`run_llm.py`, which demonstrates how to export, compile, and run LLMs with various caching strategies and precision modes. Note that this is an **experimental release** and APIs may change in future versions.
3
+
This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) and Visual Language Models (VLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry points are`run_llm.py` for text-only LLMs and `run_vlm.py` for vision-language models. Note that this is an **experimental release** and APIs may change in future versions.
4
4
5
5
### Key Features
6
6
7
7
-**Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
8
+
-**VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
8
9
-**Precision Modes:** Supports FP16, BF16, and FP32.
9
10
-**KV Cache:** Supports static and dynamic KV cache for efficient autoregressive decoding.
10
11
-**Benchmarking:** Measures and compares throughput and latency for PyTorch and TensorRT backends.
@@ -25,20 +26,33 @@ We have officially verified support for the following models:
-`--model`: Name or path of the HuggingFace LLM/VLM.
40
53
-`--tokenizer`: (Optional) Tokenizer name; defaults to model.
41
54
-`--prompt`: Input prompt for generation.
55
+
-`--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
42
56
-`--precision`: Precision mode (`FP16`, `FP32`).
43
57
-`--num_tokens`: Number of output tokens to generate.
44
58
-`--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
@@ -61,8 +75,15 @@ This codebase can be extended to
61
75
62
76
## Limitations
63
77
- We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.
78
+
-**Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly.
79
+
-**Qwen2.5‑VL vision is not compiled (LLM-only)**: We only compile the language model for Qwen2.5‑VL. The vision encoder is skipped because its `get_window_index` relies on dynamic Python operations.
0 commit comments