Skip to content

Data processing for and with foundation models! 🍎 πŸ‹ 🌽 ➑️ ➑️🍸 🍹 🍷

License

Notifications You must be signed in to change notification settings

datajuicer/data-juicer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

525 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data-Juicer: The Data Operating System for the Foundation Model Era

PyPI Downloads Docker
Docs Operators Recipes
Chinese Paper Coverage

Multimodal | Cloud-Native | AI-Ready | Large-Scale

Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as composable infrastructureβ€”providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.

Whether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clustersβ€”no glue code required.

Alibaba Cloud PAI has deeply integrated Data-Juicer into its data processing products. See Quickly submit a DataJuicer job.


πŸš€ Quick Start

Zero-install exploration:

Install & run:

uv pip install py-data-juicer
dj-process --config demos/process_simple/process.yaml

Or compose in Python:

from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    "text": ["Short", "This passes the filter.", "Text   with   spaces"]
})
res_ds = ds.process([
    TextLengthFilter(min_len=10),
    WhitespaceNormalizationMapper()
])

for s in res_ds:
    print(s)

✨ Why Data-Juicer?

1. Modular & Extensible Architecture

  • 200+ operators spanning text, image, audio, video, and multimodal data
  • Recipe-first: Reproducible YAML pipelines you can version, share, and fork like code
  • Composable: Drop in a single operator, chain complex workflows, or orchestrate full pipelines
  • Hot-reload: Iterate on operators without pipeline restarts

2. Full-Spectrum Data Intelligence

  • Foundation Models: Pre-training, fine-tuning, RL, and evaluation-grade curation
  • Agent Systems: Clean tool traces, structure context, de-identification, and quality gating
  • RAG & Analytics: Extraction, normalization, semantic chunking, deduplication, and data profiling

3. Production-Ready Performance

  • Scale: Process 70B samples in 2h on 50 Ray nodes (6400 cores)
  • Efficiency: Deduplicate 5TB in 2.8h using 1280 cores
  • Optimization: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness
  • Observability: Built-in tracing for debugging, auditing, and iterative improvement

⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo. It helps more people discover the project and keeps you notified of new releases and features.


πŸ“° News

[2026-02-02] Release v1.4.6: Copilot, Video Bytes I/O & Ray Tracing
  • πŸ€– Q&A Copilot β€” Now live on our Doc Site | DingTalk | Discord. Feel free to ask anything related to Data-Juicer ecosystem!
  • 🎬 Video Bytes I/O β€” Direct bytes processing for video pipelines
  • πŸ«† Ray Mode Tracer β€” Track changed samples in distributed processing
  • 🐳 Enhancements & fixes β€” refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.
[2026-01-15] Release v1.4.5: 20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade
  • Embodied-AI OPs: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus S3 upload/download.
  • New Pipeline OP: compose multiple OPs into one pipeline; introduced Ray + vLLM pipelines for LLM/VLM inference.
  • Docs upgrade: moved to a unified Sphinx-based documentation build/deploy workflow with isolated theme/architecture repo.
  • Enhancements & fixes: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes.
[2025-12-01] Release v1.4.4: NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O
  • NeurIPS'25 Spotlight for Data-Juicer 2.0
  • Repo split: sandbox/recipes/agents moved to standalone repos
  • S3 I/O added to loader/exporter
  • 6 new video & multimodal OPs (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes

View All Release and News Archive


πŸ”Œ Users & Ecosystems

The below list focuses on developer-facing integration and usages in alphabetical order.
Missing your project / name? Feel free to open a PR or reach out.

Data-Juicer plugs into your existing stack and evolves with community contributions:

Extensions

Frameworks & Platforms

AgentScope Β· Apache Arrow Β· Apache HDFS Β· Apache Hudi Β· Apache Iceberg Β· Apache Paimon Β· Alibaba PAI Β· Delta Lake Β· DiffSynth-Studio Β· EasyAnimate Β· Eval-Scope Β· Huawei Ascend Β· Hugging Face Β· LanceDB Β· LLaMA-Factory Β· ModelScope Β· ModelScope Swift Β· NVIDIA NeMo Β· Ray Β· RM-Gallery Β· Trinity-RFT Β· Volcano Engine

Industry

Alibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.

Academia

CAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.

Contributing & Community

We believe in building together. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.

We welcome contributions at all levels:

Discord DingTalk

Data-Juicer is made possible by the users and community:

  • Initiated by: Alibaba Tongyi Lab
  • Co-developed with: Alibaba Cloud PAI, Anyscale (Ray team), Sun Yat-sen University, NVIDIA (NeMo team), and contributors worldwide
  • Inspired by: Apache Arrow, Ray, Hugging Face Datasets, BLOOM, RedPajama-Data, ...

Documentation

For detailed documentation, please see here.

Quick Links:


πŸ“„ License & Attribution

Data-Juicer is released under the Apache License 2.0.
Attribution is appreciated: please use our badge, or text as "This project uses Data-Juicer: https://github.com/datajuicer".


πŸ“– Citation

If you find Data-Juicer useful in your work, please cite:

@inproceedings{djv1,
  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
  author={Chen, Daoyuan and Huang, Yilun and Ma, Zhijian and Chen, Hesen and Pan, Xuchen and Ge, Ce and Gao, Dawei and Xie, Yuexiang and Liu, Zhaoyang and Gao, Jinyang and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
  booktitle={SIGMOD},
  year={2024}
}

@article{djv2,
  title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models},
  author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Zhang, Yilei and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
  journal={NeurIPS},
  year={2025}
}
More Publications (Click to expand)