Skip to content

Latest commit

 

History

History
87 lines (70 loc) · 4.31 KB

File metadata and controls

87 lines (70 loc) · 4.31 KB

LongPPL

This repository is the official implementation for ICLR 2025 paper What is Wrong with Perplexity for Long-context Language Modeling?

Introduction

Handling long-context inputs is crucial for large language models (LLMs). While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose LongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Additionally, we introduce LongCE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens.

LongPPL

Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Besides, experimental results also show that LongCE attains consistent improvements in a plug-and-play solution.

Requirements

Python 3.10 + Pytorch 2.3 + Transformers 4.45

pip install -r requirements.txt

LongPPL

The code support calculating LongPPL on customized LLMs and datasets. Please run:

pip install longppl

or

git clone https://github.com/PKU-ML/LongPPL.git
cd LongPPL
pip install -e .

and use the following code to calculate LongPPL:

from longppl import compute_longppl

output = compute_longppl(text, model, evaluator_model, tokenizer, evaluator_tokenizer)
print(output['longppl'])

Reproduce the paper

LongPPL

To reproduce the LongPPL experiments in our paper, please run:

cd perplexity
sh run_ppl.sh

The evaluation data can be downloaded from GovReport (tokenized). Here are our main results.

Models LongPPL(Qwen-72B-Instruct) LongPPL(Mistral Large 2) LongPPL(Llama-3.1-8B) PPL
Mixtral-8x7B 1.99 2.33 1.70 3.59
FILM-7B 2.28 2.81 1.95 4.35
Mistral-7B 2.48 3.10 2.11 4.14
Qwen1.5-14B 2.67 2.57 2.19 5.07
Qwen2-7B 2.66 2.48 2.16 4.82
Phi-3-small 2.66 2.58 2.28 5.29
CLEX-7B 3.28 3.95 2.74 4.04
Yi-6B 3.19 3.38 2.65 4.96
Yarn-7B 3.47 4.51 2.98 4.06
  • While perplexity shows almost no correlation to their long-context performance measured by the benchmarks (please refer to our paper), LongPPL demonstrates a strong correlation.

LongCE

To conduct long-context finetuning with LongCE, run accelerate config and enable DeepSpeed acceleration. deepspeed/zero3.json was the configuration file used for training.

cd finetune
sh train.sh

The training data can be downloaded from PG19 and Pile-arxiv. To run models with eabf, please downgrade the version of transformers to 4.37.0.

Evaluation on Long-context Benchmark

In the paper, we evaluate models on LongBench, LongEval and RULER. Please refer to the respective code repositories.

Citation

If you use our code, please cite

@article{fang2024wrong,
      title={What is Wrong with Perplexity for Long-context Language Modeling?}, 
      author={Lizhe Fang and Yifei Wang and Zhaoyang Liu and Chenheng Zhang and Stefanie Jegelka and Jinyang Gao and Bolin Ding and Yisen Wang},
      year={2024},
      journal={arXiv preprint arXiv:2410.23771}
}