pytorch
diff --git a/‎docsrc/tutorials/compile_hf_models.rst‎
Lines changed: 61 additions & 2 deletions b/‎docsrc/tutorials/compile_hf_models.rst‎
Lines changed: 61 additions & 2 deletions
diff --git a/‎py/torch_tensorrt/dynamo/conversion/impl/matmul.py‎
Lines changed: 6 additions & 1 deletion b/‎py/torch_tensorrt/dynamo/conversion/impl/matmul.py‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎tests/py/dynamo/models/test_llm_models.py‎
Lines changed: 1 addition & 4 deletions b/‎tests/py/dynamo/models/test_llm_models.py‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎tools/llm/README.md‎
Lines changed: 25 additions & 4 deletions b/‎tools/llm/README.md‎
Lines changed: 25 additions & 4 deletions
diff --git a/‎tools/llm/run_llm.py‎
Lines changed: 1 addition & 4 deletions b/‎tools/llm/run_llm.py‎
Lines changed: 1 addition & 4 deletions
@@ -18,6 +18,7 @@ Overview of tools/llm Directory
 The ``tools/llm`` directory provides the following tools to compile LLM models from Huggingface:
 
 * **run_llm.py**: Main entry point for model compilation, generating outputs, and benchmarking
+* **run_vlm.py**: Entry point for compiling and benchmarking Visual Language Models (VLMs)
 * **Static Cache Utilities**: ``static_cache_v1.py`` and ``static_cache_v2.py`` for KV cache optimization
 * **SDPA Attention**: ``sdpa_converter.py`` and ``register_sdpa.py`` for registering scaled dot-product attention converter and lowering pass.
 * **Testing Components**: Model-specific test files for validation
@@ -64,6 +65,30 @@ We have officially verified support for the following LLM families:
      - FP16, FP32
      - Yes
 
+Supported VLM Models
+--------------------
+We have officially verified support for the following Visual Language Models (VLMs):
+
+.. list-table::
+   :widths: 20 40 20 20 20
+   :header-rows: 1
+
+   * - Model Series
+     - HuggingFace Model Card
+     - Precision
+     - KV Cache Support ?
+     - Component Support
+   * - Qwen 2.5 VL
+     - Qwen/Qwen2.5-VL-3B-Instruct
+     - FP16, FP32
+     - Yes (static_v1 only)
+     - Language Model only (Image Encoder not supported)
+   * - Eagle2
+     - nvidia/Eagle2-2B
+     - FP16, FP32
+     - Yes (static_v1 only)
+     - Language Model and Image Encoder both supported
+
 Getting Started with run_llm.py
 -------------------------------
 
@@ -116,6 +141,36 @@ Other Usage Examples
    python tools/llm/run_llm.py --model Qwen/Qwen2.5-1.5B-Instruct --precision FP32 --benchmark
 
 
+Getting Started with run_vlm.py
+-------------------------------
+
+For Visual Language Models (VLMs), use ``run_vlm.py`` to compile and benchmark models that process both text and images.
+
+Basic Usage
+^^^^^^^^^^^
+
+.. code-block:: bash
+
+   python tools/llm/run_vlm.py \
+     --model Qwen/Qwen2.5-VL-3B-Instruct \
+     --precision FP16 \
+     --num_tokens 128 \
+     --cache static_v1 \
+     --enable_pytorch_run \
+     --benchmark
+
+Key Arguments
+^^^^^^^^^^^^^
+
+* ``--model``: Name or path of the HuggingFace VLM
+* ``--prompt``: Input prompt for generation
+* ``--image_path``: (Optional) Path to input image file. If not provided, will use a sample image
+* ``--precision``: Precision mode (``FP16``, ``FP32``)
+* ``--num_tokens``: Number of output tokens to generate
+* ``--cache``: KV cache type (``static_v1`` or empty for no KV caching)
+* ``--benchmark``: Enable benchmarking mode
+* ``--enable_pytorch_run``: Also run and compare PyTorch baseline
+
 KV Caching in Torch-TensorRT
 ---------------------------------
 
@@ -126,7 +181,7 @@ The length of KV cache = input sequence length + output sequence length (specifi
 Static Cache v1
 ^^^^^^^^^^^^^^^^
 
-The ``static_cache_v1.py`` implements KV cache  in the model graph as follows: 
+The ``static_cache_v1.py`` implements KV cache in the model graph as follows: 
 
 .. code-block:: python
 
@@ -214,9 +269,13 @@ Limitations and Known Issues
 
 * Sliding window attention (used in Gemma3 and Qwen 3 models) is not yet supported
 * Some model architectures (e.g. Phi-4) have issues with exporting the torch model.
+* For VLMs, Qwen2.5-VL image encoder compilation is not supported due to dynamic operations incompatible with torch.export.
 
 Requirements
 ^^^^^^^^^^^^
 
 * Torch-TensorRT 2.8.0 or later
-* Transformers v4.52.3
+* Transformers v4.52.3
+* For VLM models (run_vlm.py):
+  - ``pip install qwen-vl-utils`` (for Qwen2.5-VL-3B-Instruct model)
+  - ``pip install flash-attn --no-build-isolation -v`` (for Eagle2-2B model)
@@ -48,9 +48,13 @@ def matrix_multiply(
     input, other = broadcast(
         ctx, input, other, f"{name}_input", f"{name}_other", preset_diff
     )
+    # Get the original input dtype
+    input_dtype = _enums.dtype._from(input.dtype).to(torch.dtype)
+
     if (
         ctx.net.get_flag(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)
         and ctx.compilation_settings.use_fp32_acc
+        and input_dtype == torch.float16
     ):
         input = cast_trt_tensor(ctx, input, torch.float32, f"{name}_input_casted")
         other = cast_trt_tensor(ctx, other, torch.float32, f"{name}_other_casted")
@@ -63,9 +67,10 @@ def matrix_multiply(
     if (
         ctx.net.get_flag(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)
         and ctx.compilation_settings.use_fp32_acc
+        and input_dtype == torch.float16
     ):
         matmul_output = cast_trt_tensor(
-            ctx, matmul_output, torch.float16, f"{name}_output_casted"
+            ctx, matmul_output, input_dtype, f"{name}_output_casted"
         )
 
     set_layer_name(matmul_layer, target, name, source_ir)
 
@@ -44,10 +44,7 @@ def test_llm_decoder_layer(precision):
             .to("cuda")
         )
 
-        if register_sdpa._SDPA_MAPPING.get(args.model, None) is not None:
-            register_sdpa._SDPA_MAPPING[args.model](model_config=model.config)
-        else:
-            register_sdpa._SDPA_MAPPING["default"](model_config=model.config)
+        register_sdpa.enable_sdpa_converter(args.model, model.config)
         model = model.to(dtype)
         # use randint will generate nan values in the logits, use a fixed input_ids for now
         # input_ids = torch.randint(0, model.config.vocab_size, (1, args.num_tokens)).to("cuda")
 
@@ -1,10 +1,11 @@
 # Optimizing LLMs in Torch-TensorRT
 
-This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry point is `run_llm.py`, which demonstrates how to export, compile, and run LLMs with various caching strategies and precision modes. Note that this is an **experimental release** and APIs may change in future versions.
+This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) and Visual Language Models (VLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry points are `run_llm.py` for text-only LLMs and `run_vlm.py` for vision-language models. Note that this is an **experimental release** and APIs may change in future versions.
 
 ### Key Features
 
 - **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
+- **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
 - **Precision Modes:** Supports FP16, BF16, and FP32.
 - **KV Cache:** Supports static and dynamic KV cache for efficient autoregressive decoding.
 - **Benchmarking:** Measures and compares throughput and latency for PyTorch and TensorRT backends.
@@ -25,20 +26,33 @@ We have officially verified support for the following models:
 | Qwen 3 | Qwen/Qwen3-0.6B<br>Qwen/Qwen3-1.7B<br>Qwen/Qwen3-4B<br>Qwen/Qwen3-8B | FP16, FP32 | Yes |
 | Gemma 3 | google/gemma-3-1b-it | FP16, FP32 | Yes |
 
+### Supported VLM Models
+
+| Model Series | HF Model Card | Precision | KV Cache Supported ? |
+|--------------|---------------|-----------|-------------------|
+| Qwen 2.5 VL | Qwen/Qwen2.5-VL-3B-Instruct | FP16, FP32 | Yes |
+| Eagle2 | nvidia/Eagle2-2B | FP16, FP32 | Yes |
 
 ### Usage
 
-The main entry point is : `run_llm.py`
+#### Text-only LLMs: `run_llm.py`
 
 ```bash
 python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
 ```
 
+#### Vision Language Models: `run_vlm.py`
+
+```bash
+python run_vlm.py --model nvidia/Eagle2-2B --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark
+```
+
 #### Key Arguments
 
-- `--model`: Name or path of the HuggingFace LLM.
+- `--model`: Name or path of the HuggingFace LLM/VLM.
 - `--tokenizer`: (Optional) Tokenizer name; defaults to model.
 - `--prompt`: Input prompt for generation.
+- `--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
 - `--precision`: Precision mode (`FP16`, `FP32`).
 - `--num_tokens`: Number of output tokens to generate.
 - `--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
@@ -61,8 +75,15 @@ This codebase can be extended to
 
 ## Limitations
 - We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.
+- **Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly.
+- **Qwen2.5‑VL vision is not compiled (LLM-only)**: We only compile the language model for Qwen2.5‑VL. The vision encoder is skipped because its `get_window_index` relies on dynamic Python operations.
 
 ## Requirements
 
 - Torch-TensorRT 2.8.0
-- Transformers v4.52.3
+- Transformers v4.52.3
+- For VLM models (run_vlm.py):
+  - `pip install qwen-vl-utils` (for Qwen2.5-VL-3B-Instruct model)
+  - **Flash Attention**: For models using flash attention operations (e.g., Eagle2-2B), install one of the following:
+    - **Fast installation (recommended)**: `pip install flash-attn==2.8.1` (pre-built wheel, should work)
+    - **Source build (slow)**: `pip install flash-attn --no-build-isolation -v` (fallback if pre-built wheels fail)
@@ -59,10 +59,7 @@ def get_model(args):
             .cuda()
         )
         # register SDPA variant for the model
-        if register_sdpa._SDPA_MAPPING.get(args.model, None) is not None:
-            register_sdpa._SDPA_MAPPING[args.model](model_config=model.config)
-        else:
-            register_sdpa._SDPA_MAPPING["default"](model_config=model.config)
+        register_sdpa.enable_sdpa_converter(args.model, model.config)
 
     if args.precision == "FP16":
         model = model.to(torch.float16)