This repository provides a stupidly simple demonstration and visualization of Key-Value (KV) caching in transformer models, specifically using TinyLlama with Grouped Query Attention (GQA). It helps understand how KV caching works and its impact on attention computation in transformer models.
.
├── src/
│ ├── get_attention_projections.py # Extract Q,K,V,O projections from TinyLlama
│ ├── visualize_attention.py # Visualize attention patterns
│ ├── visualize_kv_caching.py # Demonstrate KV caching effects
│ ├── attention_helpers/ # Helper functions for attention computation
│ └── plot_helpers/ # Utilities for visualization
├── notebooks/
│ ├── kv_cache_demo.ipynb # Main demo notebook
│ └── kv_cache_demo.py # Python version of the demo
The source code is organized into three main components:
-
Attention Projections (get_attention_projections.py)
- Loads TinyLlama model
- Extracts Query (Q), Key (K), Value (V), and Output (O) projections
- Captures internal attention states during inference
-
Attention Visualization (visualize_attention.py)
- Visualizes raw attention patterns
- Demonstrates Grouped Query Attention (GQA) mechanics
- Shows attention distribution across heads
-
KV Caching Visualization (visualize_kv_caching.py)
- Demonstrates how KV caching works
- Visualizes attention patterns with and without caching
- Shows cache reuse across multiple inputs
The kv_cache_demo.ipynb (and its Python equivalent) serves as a one-stop shop for the entire demonstration. It:
- All the things that scripts do but in a jupyter notebook!
- Uses TinyLlama (1.1B parameters) as the base model
- Implements Grouped Query Attention (GQA) for efficient attention computation
- Demonstrates practical KV caching implementation
The repository demonstrates KV caching through the following approach:
-
Cache Creation
- Computes and stores Key (K) and Value (V) vectors for input sequences
- Shows how these cached vectors can be reused
-
Cache Utilization
- Demonstrates how subsequent tokens can reuse cached KV pairs
- Shows performance optimization by avoiding redundant computations
-
Visualization
- Provides visual comparisons of attention patterns with and without caching
- Shows how attention computation changes when using cached values
- Illustrates cache reuse across different input sequences
uv pip install -r requirements.txtIt's high time you use uv to install the dependencies!
Sagar Sarkale