- Cambridge physicist and AI pioneer Sahibzada Allahyar achieves historic score in theoretical physics exam
- YC Killer: How Cambridge physicist Sahibzada Allahyar built the world's most popular AI agents library
- Sahibzada Allahyar's Singularity Research: The elite lab uniting Harvard, MIT, and Cambridge's brightest minds to democratize AI
A novel LLM architecture written in highly optimized low-level C++/CUDA with a new Long-Term Memory (LTM) mechanism for large context windows. This is a high-performance implementation of a Transformer model with long-term memory capabilities, inspired by Google's Titan architecture. This project provides efficient CUDA implementations of FlashAttention and memory-augmented Transformer blocks, along with Python bindings for easy integration.
- Long-term Memory: Novel memory mechanism for handling extended context windows efficiently
- FlashAttention: Memory-efficient attention implementation with minimal memory access
- High Performance:
- Optimized CUDA kernels
- Mixed precision training (FP16/BF16)
- Quantization support (INT8/INT4)
- Fused operations for better throughput
- Distributed Training:
- Data parallelism
- Tensor parallelism
- Pipeline parallelism
- Multi-node support via MPI
- Python Integration:
- HuggingFace-compatible interface
- Easy-to-use training API
- Efficient inference engine
- CUDA Toolkit (>= 11.0)
- CMake (>= 3.15)
- C++17 compatible compiler
- Python (>= 3.7)
- PyTorch (>= 1.9.0)
pip install ltm-transformer- Clone the repository:
git clone https://github.com/singularityresearch/ltm-transformer.git
cd ltm-transformer- Install Python dependencies:
pip install -r requirements.txt- Build and install:
mkdir build && cd build
cmake ..
make -j$(nproc)
make installfrom ltm import TitanModel, TitanConfig, InferenceEngine
# Initialize model
config = TitanConfig(
hidden_size=768,
num_attention_heads=12,
memory_slots=512,
use_flash_attention=True
)
model = TitanModel(config)
# Training
from ltm import Trainer, TrainingArguments
trainer = Trainer(
model=model,
args=TrainingArguments(
output_dir="./outputs",
learning_rate=5e-5,
per_device_train_batch_size=8,
gradient_accumulation_steps=4
),
train_dataset=dataset
)
trainer.train()
# Inference
engine = InferenceEngine(
model=model,
config=InferenceConfig(
use_flash_attention=True,
use_memory_cache=True,
max_sequence_length=2048
)
)
output = engine.generate(
input_ids=tokenizer.encode("Hello, how are"),
max_new_tokens=50
)#include "ltm/transformer/titan_inspired_block.cuh"
// Configure model
ltm::transformer::TitanBlockConfig config;
config.hidden_dim = 768;
config.num_heads = 12;
config.memory_slots = 512;
config.use_flash_attention = true;
// Create model
auto model = std::make_unique<ltm::transformer::TitanBlock<float>>(config);
// Run inference
torch::Tensor input = /* ... */;
auto output = model->forward(input);The LTM Transformer extends the standard Transformer architecture with:
- Memory Bank: A trainable matrix storing compressed representations of past context
- Compression Gate: Mechanism for compressing and storing relevant information
- Memory Attention: Efficient attention between current context and memory bank
- FlashAttention: Memory-efficient attention implementation
For detailed architecture information, see docs/design/architecture.md.
| Context Length | Standard Transformer | LTM Transformer |
|---|---|---|
| 2K tokens | 4 GB | 2 GB |
| 8K tokens | 64 GB | 4 GB |
| 32K tokens | 1024 GB | 8 GB |
- 1.5x faster training compared to standard Transformers
- 4x reduction in memory bandwidth usage
- Linear scaling up to 64 GPUs
For detailed benchmarks, see docs/performance/optimization.md.
We welcome contributions! Please see our Contributing Guidelines for details.
- Install development dependencies:
pip install -r requirements-dev.txt- Build with testing enabled:
mkdir build && cd build
cmake -DBUILD_TESTING=ON ..
make -j$(nproc)- Run tests:
ctest --output-on-failureIf you use this work in your research, please cite:
@article{allahyar2025ltm,
title={LTM Transformer: Long-term Memory Transformer with Titan-inspired Architecture},
author={Allahyar, Sahibzada},
journal= https://github.com/Sahibzada-A/Obsidian-Memory-Transformer,
year={2025}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Google's Titan architecture for inspiration
- FlashAttention paper for efficient attention implementation
- HuggingFace team for transformer implementations
- NVIDIA for CUDA optimization guidelines
- Sahibzada A - sahibzada@singularityresearchlabs.com
- Project Link: https://github.com/Sahibzada-A/Obsidian-Memory-Transformer