|
1 | | -# TEDD-Ranker |
2 | | -## Overview |
3 | | -One stop workplace to calculate the efficiency and feasibility of your data selection methods and compare with existing methods! |
| 1 | +# Take the Essence and Discard the Dross (TEDD-Ranker): A Rethinking on Data Selection for Fine-Tuning Large Language Models |
| 2 | + |
| 3 | +## ✨ Latest News |
| 4 | +- [02/08/2025]: 🎉🎉🎉 Our paper has been accepted at **NAACL 2025**! The full paper is available [here](https://arxiv.org/abs/XXXX.XXXXX). |
| 5 | +- [02/10/2025]: Our latest **TEDD-Ranker** implementation and dataset releases are now available! Check them out at [TEDD-Ranker Website](https://zicheliu.com/TEDD-Ranker/). |
| 6 | +- [02/12/2025]: Addressed minor errors in the **feasibility ranking plot** and **feasibility rank table** (Appendix Figure 5). The latest rankings are correctly reflected on our website and in the newest **ArXiv version**. |
| 7 | + |
| 8 | +## ⚡ Introduction |
| 9 | +Fine-tuning Large Language Models (LLMs) benefits significantly from selecting high-quality data rather than merely increasing dataset size. Our work introduces: |
| 10 | + |
| 11 | +- A **three-stage framework** for data selection: **feature extraction, criteria design, and selector evaluation**. |
| 12 | +- A **unified comparison approach** to measure data selection methods using **efficiency (Performance Improvement Ratio - PIR)** and **feasibility (flexibility and simplicity ranks)**. |
| 13 | +- A ranking-based **TEDD-Ranker** that evaluates methods based on their efficiency-feasibility tradeoff. |
| 14 | + |
| 15 | +Our key findings indicate that **targeted quality measurement leads to higher efficiency at the cost of feasibility**. Our **unified ranking approach provides a standardized comparison** across existing data selection methods. |
| 16 | + |
| 17 | +<div align=center> |
| 18 | +<img src="assets/tedd_pipeline.png" width = "640" alt="TEDD-Ranker Pipeline" align=center/> |
| 19 | +</div> |
| 20 | + |
| 21 | +## 💡 Key Insights |
| 22 | + |
| 23 | +1. **Efficiency vs. Feasibility Tradeoff**: The best-performing selection methods optimize **PIR**, but at the expense of general applicability. |
| 24 | +2. **Three-Stage Framework**: |
| 25 | + - **Feature Extraction**: Extracts linguistic and model-oriented features. |
| 26 | + - **Criteria Design**: Defines internal and external quality labels. |
| 27 | + - **Selector Evaluation**: Assesses models via counterpart evaluations and win-tie-loss metrics. |
| 28 | +3. **Unified Ranking System**: Provides **comparable rankings** based on a mix of **efficiency and feasibility indicators**. |
| 29 | + |
| 30 | +## 🔗 TEDD-Ranker: Code & Visualization |
| 31 | +We provide an **interactive visualization** of our method rankings and selection efficiency comparisons at: |
| 32 | +🔗 [TEDD-Ranker Visualization](https://zicheliu.com/TEDD-Ranker/) |
| 33 | + |
| 34 | +- **Efficiency Rank**: Performance Improvement Ratio (PIR) vs. Selected Dataset Fraction (SDF). |
| 35 | +- **Feasibility Rank**: Simplicity and flexibility of each method. |
| 36 | + |
| 37 | +*Note: The feasibility ranking table and feasibility rank plot contained minor errors in the original version. These are now corrected in the latest ArXiv update and TEDD-Ranker website.* |
| 38 | + |
| 39 | +## 📈 Key Results |
| 40 | + |
| 41 | + |
| 42 | +<div align=center> |
| 43 | +<img src="assets/efficiency_feasibility_ranking.png" width = "640" alt="Efficiency vs. Feasibility" align=center/> |
| 44 | +</div> |
| 45 | + |
| 46 | + |
| 47 | + |
| 48 | +## 🧐 Limitations |
| 49 | + |
| 50 | +- **Error Corrections**: Our feasibility ranking plot (Appendix Figure 5) had **minor ranking errors** in early versions. The website and **latest ArXiv version** are now correct. |
| 51 | +- **Ongoing Updates**: TEDD-Ranker is evolving. We welcome feedback and **will update rankings with new datasets/methods**. |
| 52 | +- **Contact for Fixes **: If you spot any inconsistencies, **email [email protected] or [email protected]**. Confirmed errors will be corrected and updated. |
| 53 | + |
| 54 | +## 🤝 Acknowledgements |
| 55 | +This research is supported by: |
| 56 | +- The School of Data Science, **The Chinese University of Hong Kong, Shenzhen**. |
| 57 | +- **Shenzhen Research Institute of Big Data**. |
| 58 | + |
| 59 | +## 📜 Citation |
| 60 | +```bibtex |
| 61 | +@article{liu2024take, |
| 62 | + title={Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models}, |
| 63 | + author={Liu, Ziche and Ke, Rui and Jiang, Feng and Li, Haizhou}, |
| 64 | + journal={arXiv preprint arXiv:2406.14115}, |
| 65 | + year={2024} |
| 66 | +} |
| 67 | +``` |
| 68 | +<!-- |
| 69 | +## ⭐ Star History |
| 70 | +<a href="https://star-history.com/#tREeFrOGcoder/TEDD-Ranker&Date"> |
| 71 | + <picture> |
| 72 | + <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=tREeFrOGcoder/TEDD-Ranker&type=Date&theme=dark" /> |
| 73 | + <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=tREeFrOGcoder/TEDD-Ranker&type=Date" /> |
| 74 | + <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=tREeFrOGcoder/TEDD-Ranker&type=Date" /> |
| 75 | + </picture> |
| 76 | +</a> --> |
4 | 77 |
|
5 | | ---- |
6 | | -## How to use: |
7 | | -Check out the link: https://zicheliu.com/TEDD-Ranker/ |
|
0 commit comments