A comprehensive tool for visualizing attention patterns in Qwen2.5-VL vision-language model during text generation. This tool allows you to see exactly which parts of an image the model focuses on when generating each token.
- Python 3.8+
- CUDA-capable GPU (recommended, 8GB+ VRAM for 3B model)
- 16GB+ RAM
cd Qwen_VL_2_5_Visualizer# Using conda
conda create -n qwen_viz python=3.10
conda activate qwen_viz
# Or using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpython app.pyThen open your browser and navigate to: http://127.0.0.1:7861
Edit config.py to customize:
# Model settings
MODEL_NAME = "Qwen/Qwen2.5-VL-3B-Instruct"
# Attention extraction
EXTRACT_ALL_LAYERS = True # Or specify SPECIFIC_LAYERS
# Visualization
HEATMAP_COLORMAP = "jet" # 'hot', 'viridis', 'plasma', etc.
HEATMAP_ALPHA = 0.5 # Transparency (0-1)
# Memory optimization
STORE_ON_CPU = True # Offload attention to CPU
USE_FLOAT16 = True # Use half precision┌─────────────────────────────────────────────────────────────┐
│ Qwen2.5-VL Model │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Vision Encoder │────────▶│ Language Model │ │
│ │ (Patch Embed + │ │ (Decoder Layers) │ │
│ │ Transformer) │ │ │ │
│ └──────────────────┘ └──────────────────┘ │
│ │ │ │
│ │ ┌──────▼──────┐ │
│ │ │ Attention │◀──────────┤
│ │ │ Weights │ Hooks │
│ │ └─────────────┘ │
└───────────┼──────────────────────────┬───────────────────┘
│ │
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ Image Patches │ │ Attention Maps │
│ Coordinates │ │ (per token) │
└───────┬───────┘ └────────┬─────────┘
│ │
└────────────┬─────────────┘
▼
┌──────────────────────┐
│ Visualization │
│ (Heatmap Overlay) │
└──────────────────────┘
-
Attention Extractor (
attention_extractor.py)- Registers forward hooks on decoder attention modules
- Captures attention weights during generation
- Stores per-layer, per-head attention
-
Attention Processor (
attention_processor.py)- Maps vision tokens to image patch positions
- Aggregates attention across layers/heads
- Creates 2D attention maps
-
Visualizer (
visualization.py)- Generates heatmap overlays
- Supports multiple colormaps and transparency
- Creates comparison views
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Qwen Team: For the amazing Qwen2.5-VL model
- Hugging Face: For Transformers library
- Gradio: For the easy-to-use interface framework
For questions or issues, please open an issue on GitHub.