This repo contains code used for experiments in the paper "Cross-layer Attention Sharing for Pre-trained Large Language Models".
- [2025/10/16] We’re delighted to share that our work has been accepted by TACL! We released our training and evaluation codebase.
- [2024/08/04] We released the paper.
To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights.
Driven by these insights, we introduce LISA, a lightweight substitute for self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LISA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53%-84% of the total layers. Our implementations of LISA achieve a 6X compression of Q and K matrices within the attention mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively.
Our code is based on LLaMA-Factory (0.9.0) and lm-evaluation-harness. Please follow the instructions for LLaMA-Factory and install it first.
pip install -e LLaMA-Factory-0.9.0
pip install -e lm-evaluation-harness[NOTE!] See requirements for our Python environment.
You can use the train-LA scripts to train LISA models. Our training hyperparameters are as follows:
attn_type: The architecture of the attention heads alignment module, e.g.,two_layers,one_layer, ordirect_share.hidden_size_low_rank: The rank of the Q and K matrices, e.g.,800.special_training_attn_layers: The indices of the layers where LISA is applied, e.g.,7,8,9,10,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39.crossentropy_loss_weight: The weight of the cross-entropy (CE) loss, e.g.,0.75.mse_loss_weight: The weight of the mean squared error (MSE) loss, e.g.,0.25.if_cal_eval_loss: Set totrueto print the loss during evaluation.
We switch between two versions of modeling_llama.py to change the shapes of the feed-forward networks (FFNs) in the attention heads alignment module:
transformers-4.45.0/src/transformers/models/llama/modeling_llama_256.pyimplements LISA with a two-layer FFN whose shapes are [2 * num_heads, 256] and [256, num_heads].transformers-4.45.0/src/transformers/models/llama/modeling_llama_320.pyimplements LISA with a two-layer FFN whose shapes are [2 * num_heads, 320] and [320, num_heads].- You can also use either of the above
modeling_llama.pyfiles to apply a one-layer FFN with the shape [2 * num_heads, num_heads].
To obtain high-quality pre-training data, we applied different sampling proportions to subsets of RedPajama-Data-1T, including 10% of ArXiv, 2% of C4, 100% of StackExchange, 100% of Wikipedia, and 10% of GitHub. The resulting dataset contains 20 billion tokens, and we sampled 4.2 billion tokens from the first 7 billion tokens of this dataset for the uptraining experiments.
You can download the corpus containing the first 7 billion tokens here.
We extend our sincere gratitude to LLaMA-Factory, lm-evaluation-harness, and LLaMA Team for their contributions. We also thank Siming Wu and Peinan Feng for their valuable advice, and extend our sincere gratitude to action editor Xavier Carreras and the anonymous TACL reviewers for their insightful feedback and constructive suggestions.
If you find our work helpful, please kindly cite us as:
@article{DBLP:journals/corr/abs-2408-01890,
author = {Yongyu Mu and
Yuzhang Wu and
Yuchun Fan and
Chenglong Wang and
Hengyu Li and
Qiaozhi He and
Murun Yang and
Tong Xiao and
Jingbo Zhu},
title = {Cross-layer Attention Sharing for Large Language Models},
journal = {CoRR},
volume = {abs/2408.01890},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2408.01890},
doi = {10.48550/ARXIV.2408.01890},
eprinttype = {arXiv},
eprint = {2408.01890},
timestamp = {Thu, 21 Aug 2025 16:25:58 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2408-01890.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}