A language model architecture that uses multi-resolution fractal attention patterns for efficient autoregressive text generation.
FractalGPT organizes transformer layers into a fractal hierarchy with level-specific attention patterns:
- Level 0 (coarse): Strided causal attention -- sparse, long-range context
- Level 1 (medium): Mixed local window + global landmark attention
- Level 2 (fine): Dense local causal attention -- recent context focus
This creates a multi-scale representation where coarse levels capture global structure and fine levels handle local coherence, similar to how fractals exhibit self-similar structure at different scales.
Built on a hierarchical transformer with:
- Fractal attention patterns: Each level uses a different causal attention sparsity pattern
- Hierarchical FFN sizing: Finer levels get larger feed-forward networks
- Adaptive computation: Optional layer skipping at inference time based on activation norms
- Standard training: AdamW optimizer with cosine LR schedule
| Config | Hidden | Layers | Heads | Params |
|---|---|---|---|---|
| Tiny | 128 | 4 | 2 | ~2M |
| Small | 256 | 6 | 4 | ~8M |
| Medium | 512 | 8 | 8 | ~26M |
# Install dependencies
pip install torch numpy datasets tokenizers tqdm matplotlib jax
# Train on TinyStories (recommended for convergence)
python train_fractal_voronoi_combined.py --small --tinystories
# Train on mixed TinyStories + WikiText
python train_fractal_voronoi_combined.py --medium
# Custom settings
python train_fractal_voronoi_combined.py --medium --tinystories --steps 5000 --epochs 20 --lr 3e-4--tiny,--small,--medium: Model size presets--tinystories: Train on TinyStories only (simpler distribution, lower achievable loss)--steps N: Steps per epoch--epochs N: Number of epochs--lr RATE: Learning rate--resume PATH: Resume from checkpoint
fractal_gpt_production.py-- Model architecture (attention, layers, FractalGPT)llm_fractal_training.py-- Configuration and fractal layer distributiontrain_fractal_voronoi_combined.py-- Training script with data loading and evaluationellipsoid_core.py-- Ellipsoid geometry primitivesfractal_tree.py-- Fractal tree decompositionadaptive_moments.py-- Hierarchical moment tracking
Small model (8M params, 8 epochs on TinyStories + WikiText):
- Validation loss: 2.82
- Validation perplexity: 16.8
- Generates coherent short stories from prompts
The fractal attention hierarchy is motivated by:
- Multi-resolution analysis: Different layers attend at different scales, analogous to wavelet decomposition
- Fractal self-similarity: The hierarchical layer structure mirrors fractal geometry where each level has a structurally similar but differently-scaled attention pattern
- Ellipsoid decomposition: Parameter space geometry is modeled via ellipsoid primitives for hierarchical optimization
MIT