Skip to content

pengr/LLM-Synthetic-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 

Repository files navigation

Live LLM-Synthetic-Data Papers (Updated to July,2025)

LICENSE commit PR GitHub Repo stars

This repo collects the most live-updated, finely categorized work on LLM-synthetic-data, such as papers, tools, datasets, blogs, and more.

If you find this useful, feel free to follow us and star a ⭐. Thanks to all great GitHub contributors!

Entries marked with 🔥 are those we highly recommend.


Latest updates

  • Section 3 (Surveys): added domain-specific synthesis surveys.
  • Section 4 (Method): reorganized by LLM training stages with ultra-fine subcategories for each paper (highly recommended).
  • Section 5 (Analysis): new section for synthetic-data analyses.
  • Section 6 (Application): expanded to 19 new sub-areas.

Contents

1. Githubs

2. Blogs

3. Surveys

4. Methods

4.1. Pre-training

4.2. Continue Pre-training

  • Phi-1: Textbooks Are All You Need Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li. Arxiv 2023.
  • MAmmoTH2: Scaling Instructions from the Web. Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen. Neurips 2024.🔥
  • Scaling Laws of Synthetic Data for Language Models Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei. Arxiv 2025.

4.3. Instruction Tuning

4.3.1 Before ChatGPT came out

4.3.2 Rephrasing instructions

4.3.3 Instruction Inversion

4.3.4 Synthesis instructions and outputs using LLM

4.3.5 Generating context from instruction-answer pairs

4.4. Alignment

4.4.1 Self-align

4.4.2 Human Principle Alignment

4.4.3 RLAIF

4.4.4 Safety

4.5. Refinement Learning

4.6. LLM Benchmarking

4.7. Using synthetic and real data jointly

5. Analysis

5.1. Effect of Synthetic Data

5.2 Evaluation of Synthetic Data

5.2.1 Artifactuality

5.2.2 Fidelity

5.2.3 Diversity

6. Application Areas

6.1 Mathematical Reasoning

6.2 Code Generation

6.3 Agent and Tool Use

6.4 Vision and Language

6.5 Retrieval-Augmented Generation

6.6 Long Context

6.7 Writing

6.8 AI For Science

6.9 Text-to-SQL

6.10 Synergy between Large and Small Models

6.11 Weak-to-Strong

6.12 Distill Small Model

6.13 Multilingual Data

6.14 Structured Data

6.15 Natural Language Understanding

6.16 Logic Reasoning

6.17 Dialogue System

6.18 Federated Learning

6.19 Generative Design

5.15. Knowledge-Intensive Data

7. Tools

8. Datasets

  • Open Artificial Knowledge Vadim Borisov, Richard Schreiber. ICML Workshop 2024.
  • PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-Jian Jiang, Alexander M. Rush, ACL 2022 Demo.
  • Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi, Daniel Khashabi. EMNLP 2022.

Star History

Star History Chart

About

A live reading list for LLM data synthesis (Updated to July, 2025).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 10