Skip to content
This repository was archived by the owner on Jan 15, 2026. It is now read-only.

bet0x/transmla-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TransMLA: Multi-Head Latent Attention Converter

This project implements the TransMLA approach described in the paper "TransMLA: Multi-Head Latent Attention Is All You Need" by Fanxu Meng, Zengwei Yao, and Muhan Zhang.

The implementation provides tools to convert Group Query Attention (GQA) based models to Multi-Head Latent Attention (MLA) based models to enhance performance while maintaining the same KV cache size.

Overview

Modern large language models (LLMs) often face communication bottlenecks rather than purely computational limitations. The TransMLA project addresses this issue by converting existing GQA-based models to use MLA, which offers greater expressive power with the same memory requirements.

Note: This implementation currently supports LLaMA architecture models. For other model architectures, please refer to the original TransMLA repository.

Key Features

  • Model Conversion: Convert GQA-based models (e.g., LLaMA, Qwen, Mistral) to MLA models
  • Performance Testing: Tools to benchmark and compare original vs. MLA models
  • KV Cache Reduction: Maintain the same KV cache size while improving model expressiveness
  • SVD Initialization: Specialized initialization using Singular Value Decomposition (SVD)

Installation

# Clone the repository
git clone https://github.com/bet0x/transmla-converter.git
cd transmla-converter

# Install dependencies
pip install -r requirements.txt

Usage

Converting a Model

python transmla_converter.py --model "meta-llama/Llama-3-8B" --output "llama-mla-model" --test

Testing Model Performance

python transmla_tester.py --model "llama-mla-model" --original "meta-llama/Llama-3-8B" --tokens 100

Advanced Testing Options

# Test with GPU warm-up (recommended for accurate benchmarking)
python transmla_tester.py --model "llama-mla-model" --original "meta-llama/Llama-3-8B" --tokens 100

# Skip GPU warm-up (not recommended for benchmarking)
python transmla_tester.py --model "llama-mla-model" --original "meta-llama/Llama-3-8B" --tokens 100 --no-warmup

# Test with longer context to better observe KV cache benefits
python transmla_tester.py --model "llama-mla-model" --original "meta-llama/Llama-3-8B" --tokens 100 --long-context

Fine-tuning a Converted Model

# Example of fine-tuning a converted model
python transmla_finetune.py --model "llama-mla-model" --dataset "your_dataset.jsonl" --output "fine-tuned-mla"

Or finetune using Unsloth via https://docs.unsloth.ai/basics/reasoning-grpo-and-rl/tutorial-train-your-own-reasoning-model-with-grpo

Recent Updates

March 2025 Update

  • MLA Model Detection: Added proper detection of MLA architecture in the tester script. The tester now correctly identifies and reports whether a model is using MLA or standard GQA architecture.
  • GPU Warm-up: Implemented GPU warm-up before performance testing to ensure accurate benchmarking. This addresses the issue where the first model run is typically slower due to GPU initialization and compilation.
  • Testing Options: Added new command-line options for more flexible testing:
    • --no-warmup: Skip GPU warm-up (not recommended for accurate benchmarking)
    • --long-context: Test with longer context to better observe KV cache benefits

How It Works

TransMLA works through several key steps:

  1. Matrix Decomposition: The key and value projection matrices from GQA are decomposed using SVD.
  2. Low-Rank Factorization: The matrices are factorized into smaller components (Wa and Wb).
  3. Enhanced Expressiveness: The decomposition allows for more expressive representation in the same memory footprint.

Technical Details

The MLA approach introduces compression and decompression matrices:

  • k_compress and v_compress: Map from hidden dimension to latent dimension
  • k_decompress and v_decompress: Map from latent dimension to full attention dimension

This design allows:

  • Reduced KV cache size (same as GQA)
  • Better expressiveness (theoretically proven superior to GQA)
  • Improved performance on downstream tasks

Results

Our experiments show that TransMLA models consistently outperform their GQA counterparts:

  • Faster convergence during fine-tuning
  • Higher accuracy on benchmark tasks
  • Better performance on coding and mathematical reasoning tasks

Limitations

  • Slight increase in computation during inference
  • Minor increase in parameter count (typically <2%)
  • Requires fine-tuning to fully realize performance benefits

Citation

Please cite both the original paper and this implementation:

@article{meng2025transmla,
  title={TransMLA: Multi-Head Latent Attention Is All You Need},
  author={Meng, Fanxu and Yao, Zengwei and Zhang, Muhan},
  journal={arXiv preprint arXiv:2502.07864v2},
  year={2025}
}

@software{ferrer2025transmlaimplementation,
  author = {Ferrer, Alberto},
  title = {TransMLA Converter: Implementation of Multi-Head Latent Attention for LLMs},
  url = {https://github.com/bet0x/transmla-converter},
  year = {2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

TransMLA: Multi-Head Latent Attention Converter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages