Skip to content

takagi97/lisa

Repository files navigation

This repo contains code used for experiments in the paper "Cross-layer Attention Sharing for Pre-trained Large Language Models".

News

  • [2025/10/16] We’re delighted to share that our work has been accepted by TACL! We released our training and evaluation codebase.
  • [2024/08/04] We released the paper.

Introduction

To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights.

Driven by these insights, we introduce LISA, a lightweight substitute for self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LISA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53%-84% of the total layers. Our implementations of LISA achieve a 6X compression of Q and K matrices within the attention mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively.

Quick Start

Our code is based on LLaMA-Factory (0.9.0) and lm-evaluation-harness. Please follow the instructions for LLaMA-Factory and install it first.

pip install -e LLaMA-Factory-0.9.0
pip install -e lm-evaluation-harness

[NOTE!] See requirements for our Python environment.

Pointers

Scripts

You can use the train-LA scripts to train LISA models. Our training hyperparameters are as follows:

  • attn_type: The architecture of the attention heads alignment module, e.g., two_layers, one_layer, or direct_share.
  • hidden_size_low_rank: The rank of the Q and K matrices, e.g., 800.
  • special_training_attn_layers: The indices of the layers where LISA is applied, e.g., 7,8,9,10,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39.
  • crossentropy_loss_weight: The weight of the cross-entropy (CE) loss, e.g., 0.75.
  • mse_loss_weight: The weight of the mean squared error (MSE) loss, e.g., 0.25.
  • if_cal_eval_loss: Set to true to print the loss during evaluation.

We switch between two versions of modeling_llama.py to change the shapes of the feed-forward networks (FFNs) in the attention heads alignment module:

  • transformers-4.45.0/src/transformers/models/llama/modeling_llama_256.py implements LISA with a two-layer FFN whose shapes are [2 * num_heads, 256] and [256, num_heads].
  • transformers-4.45.0/src/transformers/models/llama/modeling_llama_320.py implements LISA with a two-layer FFN whose shapes are [2 * num_heads, 320] and [320, num_heads].
  • You can also use either of the above modeling_llama.py files to apply a one-layer FFN with the shape [2 * num_heads, num_heads].

Training Datasets

To obtain high-quality pre-training data, we applied different sampling proportions to subsets of RedPajama-Data-1T, including 10% of ArXiv, 2% of C4, 100% of StackExchange, 100% of Wikipedia, and 10% of GitHub. The resulting dataset contains 20 billion tokens, and we sampled 4.2 billion tokens from the first 7 billion tokens of this dataset for the uptraining experiments.

You can download the corpus containing the first 7 billion tokens here.

Acknowledgements

We extend our sincere gratitude to LLaMA-Factory, lm-evaluation-harness, and LLaMA Team for their contributions. We also thank Siming Wu and Peinan Feng for their valuable advice, and extend our sincere gratitude to action editor Xavier Carreras and the anonymous TACL reviewers for their insightful feedback and constructive suggestions.

Citation

If you find our work helpful, please kindly cite us as:

@article{DBLP:journals/corr/abs-2408-01890,
  author       = {Yongyu Mu and
                  Yuzhang Wu and
                  Yuchun Fan and
                  Chenglong Wang and
                  Hengyu Li and
                  Qiaozhi He and
                  Murun Yang and
                  Tong Xiao and
                  Jingbo Zhu},
  title        = {Cross-layer Attention Sharing for Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2408.01890},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2408.01890},
  doi          = {10.48550/ARXIV.2408.01890},
  eprinttype    = {arXiv},
  eprint       = {2408.01890},
  timestamp    = {Thu, 21 Aug 2025 16:25:58 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2408-01890.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

About

Cross-layer Attention Sharing for Pre-trained Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors