This repository records my own journey to learn AI. It includdes the following catergories.
transformer/self_attention_example.py
: Basic self-attention implementation with fixed 64-dimension key/valuetransformer/self_attn.py
: BERT attention visualizationtransformer/self_attn.ipynb
: Jupyter notebook for attention analysis
transformer/flex_attention.py
: Custom multihead attention with performance benchmarkingtransformer/flash_attention_comparison.py
: FlashAttention vs FlexAttention comparisontransformer/attention_benchmark_comparison.py
: NEW - Comprehensive comparison of Simple vs Flash vs Flex Attention (Inference Focus)transformer/test_flash_vs_flex.py
: Test script for attention comparisonpytorch/flex_attn_test.py
: PyTorch flex attention test with 64-dimension key/valuepytorch/simple_flex_test.py
: NEW - Simple FlexAttention test for inference (< 50 lines)
cd pytorch
python simple_flex_test.py
cd transformer
python attention_benchmark_comparison.py
cd transformer
python test_flash_vs_flex.py
cd pytorch
python flex_attn_test.py
cd transformer
python flex_attention.py
All attention implementations use a fixed 64-dimension for key and value tensors, providing:
- Consistent performance characteristics
- Easier comparison between implementations
- Optimized memory usage patterns
- Simple Attention: Baseline implementation for comparison
- FlashAttention: Optimized attention using PyTorch's SDPA with FlashAttention backend
- FlexAttention: PyTorch's flexible attention implementation with score modification
- Inference Performance: Optimized for inference workloads (no gradient computation)
- Performance comparison across different sequence lengths
- Memory usage analysis
- Output correctness validation
- Speedup calculations relative to fastest implementation
- Tokens per second metrics for throughput analysis