Data Darwinism Part II: DataEvolve

AI can autonomously evolve pretraining data curation.

This repository contains the implementation behind DataEvolve, the framework described in Data Darwinism Part II: DataEvolve. Instead of relying on a single manually designed cleaning recipe, DataEvolve improves data curation through an iterative loop: observe the data, generate a strategy, execute it, evaluate the result, and refine the next strategy from feedback.

🔗 Dataset: Darwin-CC on Hugging Face

🧭 Overview

Large pretraining corpora are heterogeneous. Different domains, content types, and quality levels exhibit different failure modes, which makes static filtering rules hard to scale.

DataEvolve addresses that problem by evolving a cleaning strategy for each category. The system is organized around four core components:

Data Observer identifies recurring quality issues in raw data
Strategy Designer proposes the next cleaning strategy
Data Cleaner applies that strategy to sample data
Quality Judge scores the result and surfaces remaining issues

Across iterations, the system keeps track of:

an experience pool of discovered data issues
a strategy pool of prompts, scores, and diagnostic feedback

This gives the pipeline memory, so later iterations can improve on earlier ones rather than restart from scratch.

✨ Why It Matters

The report motivates DataEvolve as a way to replace expensive manual strategy design with automated, category-specific refinement. The key idea is not just cleaning data with an LLM, but evolving the cleaning strategy itself through execution and feedback.

This repository captures that core mechanism.

📦 What This Repo Covers

This codebase focuses on the strategy-evolution loop:

issue discovery on sampled data
prompt evolution across iterations
sample-based cleaning
quality evaluation and feedback accumulation

It is best understood as the pipeline layer of the larger research system, not a full one-command reproduction of all large-scale cleaning, pretraining, and benchmark experiments from the report.

🔄 Workflow

At a high level, one iteration looks like this:

Observe raw samples and record category-specific quality issues.
Generate a new cleaning prompt using accumulated issues and prior prompt history.
Run the data cleaner on sample data.
Evaluate raw/cleaned pairs and assign quality scores.
Feed the new analysis back into the next iteration.

Over time, the best strategy becomes the parent for subsequent refinements.

🚀 Quick Start

Set the required API key:

export OPENAI_API_KEY="YOUR_KEY"

Optional endpoint override:

export OPENAI_BASE_URL="https://api.openai.com/v1"

Prepare category-specific sample data under:

data/level={level}--domain={domain}--content={content}/sample_data/sample_data.jsonl

Then run:

python pipeline.py

The pipeline will run data observation, strategy design, sample cleaning, and quality evaluation, while saving prompt history and discovered issues to disk.

🤗 Darwin-CC dataset: https://huggingface.co/datasets/GAIR/Darwin-CC

Repository Layout

.
├── pipeline.py
├── config.py
├── clean/
├── observer/
├── prompt_designer/
├── quality_judge/
├── executor/
├── tools/
├── utils/
├── report.pdf
└── pic.png

📝 Outputs

During execution, the pipeline writes category-scoped artifacts such as:

experience_database.jsonl
prompt.jsonl
sample_cleaned_data/{prompt_id}.jsonl

Together, these files represent the accumulated memory of the evolutionary loop.

📚 Citation

@misc{mi2026dataevolve,
  title        = {Data Darwinism Part II: DataEvolve: AI can Autonomously Evolve Pretraining Data Curation},
  author       = {Tiantian Mi and Dongming Shan and Zhen Huang and Yiwei Qin and Muhang Xie and Yuxuan Qiao and Yixiu Liu and Chenyang Zhou and Pengfei Liu},
  year         = {2026},
  note         = {Technical report and code repository},
  howpublished = {\url{https://github.com/GAIR-NLP/DataEvolve}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Darwinism Part II: DataEvolve

🧭 Overview

✨ Why It Matters

📦 What This Repo Covers

🔄 Workflow

🚀 Quick Start

Repository Layout

📝 Outputs

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
clean		clean
executor		executor
observer		observer
prompt_designer		prompt_designer
quality_judge		quality_judge
tools		tools
utils		utils
CITATION.cff		CITATION.cff
README.md		README.md
config.py		config.py
pic.png		pic.png
pipeline.py		pipeline.py
report.pdf		report.pdf

Folders and files

Latest commit

History

Repository files navigation

Data Darwinism Part II: DataEvolve

🧭 Overview

✨ Why It Matters

📦 What This Repo Covers

🔄 Workflow

🚀 Quick Start

Repository Layout

📝 Outputs

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages