Standalone implementation of the TitanMAC architecture combining neural long-term memory with nested learning optimization.
| Paper | Link | Key Contribution |
|---|---|---|
| Titans: Learning to Memorize at Test Time | arXiv:2501.00663 | Neural long-term memory module, MAC/MAG/MAL variants, surprise-based updates |
| Titans Revisited | arXiv:2510.09551 | Lightweight reimplementation, empirical validation across domains |
| Titans & MIRAS Blog | Google Research | Unified framework viewing attention as associative memory |
| It's All Connected (MIRAS) | arXiv:2504.13173 | MIRAS framework, attentional bias, retention regularization, MONETA/YAAD/MEMORA variants |
| Nested Learning | Google Research | DMGD, CMS, multi-frequency updates, continual learning paradigm |
- Windowed Attention: O(T*w) memory instead of O(T²)
- Persistent Tokens: Global context via bidirectional attention
- Neural Long-Term Memory: Gradient-based memory with surprise updates
- MAC/MAG/MAL Variants: Three memory integration strategies
- Deep Momentum Gradient Descent (DMGD): Learned momentum
- Continuum Memory System (CMS): Multi-frequency updates
pip install -e .from titans_core import TitanMACConfig, TitanMAC
# Create config
config = TitanMACConfig(
vocab_size=50000,
d_model=512,
n_heads=8,
n_layers=12,
window_size=256,
n_persistent=16,
)
# Create model
model = TitanMAC(config)
# Forward pass
input_ids = torch.randint(0, 50000, (2, 512))
outputs = model(input_ids=input_ids, labels=input_ids)
loss = outputs["loss"]Train on generated math problems:
python examples/train_math.py --steps 1000 --batch-size 4 --device cudaWith neural memory enabled:
python examples/train_math.py --steps 1000 --use-neural-memory --memory-capacity 256TitanMAC
├── Token Embeddings + Positional Embeddings
├── Persistent Tokens (learnable, bidirectional attention)
├── N × TitanBlock
│ ├── RMSNorm → WindowedAttention → Residual
│ └── RMSNorm → MLP → Residual
├── Optional: NeuralMemory (gradient-based)
├── Final RMSNorm
└── LM Head (optionally tied to embeddings)
Key parameters in TitanMACConfig:
| Parameter | Default | Description |
|---|---|---|
d_model |
640 | Model dimension |
n_heads |
10 | Attention heads |
n_layers |
16 | Transformer layers |
window_size |
512 | Local attention window |
n_persistent |
16 | Global persistent tokens |
use_neural_memory |
False | Enable gradient-based memory |
titans_variant |
"MAC" | Memory integration (MAC/MAG/MAL) |
titans_core/
├── config.py # TitanMACConfig dataclass
├── blocks/
│ ├── norms.py # RMSNorm
│ ├── mlp.py # MLPBlock
│ └── titan_block.py # TitanBlock
├── attn/
│ └── windowed_attention.py
├── memory/
│ ├── memory_bank.py # Key-value retrieval
│ └── neural_memory.py # Gradient-based memory
├── models/
│ ├── titanmac.py # Main model
│ └── titanmac_wrapper.py # HuggingFace wrapper
└── opt/
├── continuum_optimizer.py # Nested optimizer
├── nested_controller.py # LR modulation MLP
├── param_groups.py # Parameter grouping
├── dmgd.py # Deep Momentum GD
└── cms.py # Continuum Memory System
- PyTorch >= 2.1
- Transformers >= 4.35 (for wrapper only)
GPL