This ropo contains code implemnentation of our paper MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
MKA(3-path hierarchical memory attention)FastMKA(route-fused variant for speed)- CUDA extension scaffolding for fused routing + online softmax
- Reproducible training/evaluation scripts following the paper setup
mka/layers/: PyTorch modules (MKAFullAttention,FastMKAAttention)mka/hf/: HuggingFace monkey patch support (Qwen/Llama styleself_attn)mka/cuda/: CUDA/C++ extension skeleton for fused kernelsscripts/: train/eval/benchmark entry pointsconfigs/: experiment configs
cd code
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/train_wikitext2.py --config configs/qwen7b_fastmka.yaml-
FastMKA forward path in PyTorch:
- L1 local memory (
X) - L2 causal session summary (prefix EMA)
- Optional L3 retrieved memory
- Learned routing gate (
softmax(MLP(Q))) - Route-fusion before single KV projection
- L1 local memory (
-
MKA full path in PyTorch:
- Per-level attention over L1/L2/L3
- Soft mixture over outputs
-
CUDA design:
- Tiled QK score calculation
- Online max/denominator (
m,z) update - Fused route application before attention
- Causal masking support
-
HuggingFace direct patch path:
mka/hf/attention.py:HFFastMKAAttentionwrappermka/hf/patch.py: monkey patch over decoderself_attnscripts/train_hf_patch.py: train loop with patched attentionconfigs/hf_qwen_fastmka.yaml,configs/hf_llama_fastmka.yamlscripts/launch_dp_torchrun.sh: DP launch (torchrun)scripts/launch_tp_dp_accelerate.sh: TP+DP launch path (accelerate + HF TP)
cd mka/cuda
python build.py build_ext --inplace
cd ../..If build fails with CUDA_HOME environment variable is not set, export your CUDA path first, e.g.
export CUDA_HOME=/usr/local/cuda.
python scripts/train_hf_patch.py --config configs/hf_qwen_fastmka.yamlbash scripts/launch_dp_torchrun.sh configs/hf_qwen_fastmka.yaml 4- Set
tp_sizein config (>1). - Launch:
bash scripts/launch_tp_dp_accelerate.sh configs/hf_qwen_fastmka.yaml 4Notes:
- TP relies on HuggingFace native
tp_plan="auto"support for the model/version. - For strict reproducibility, pin
transformersand CUDA versions used in your paper runs. - FastMKA CUDA kernel is used automatically when:
- extension
fastmka_cudais available, - tensor is CUDA,
head_dim <= 256,- no extra additive attention mask is required.
- extension
The paper reports LongBench and RULER. Use:
scripts/run_longbench.shfor LongBench workflowscripts/run_ruler.shfor RULER workflow
@inproceedings{mka2026,
title = {MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning},
author = {Dong Liu and Yanxuan Yu and Ben Lengerich and Ying Nian Wu},
booktitle = {Proceedings of the ACM International Conference on Computing Frontiers (CF '26)},
year = {2026}
}