Skip to content

Releases: datajuicer/data-juicer

Release v1.5.0: Partitioned Ray Executor; Embodied-AI OPs; OP-level Env Management

26 Feb 05:06
2e62d2a

Choose a tag to compare

Major Updates

  • 📊 Stats: 244 files changed with 22,394 additions and 2,053 deletions, from 12 contributors
  • 🗂️ New partitioned ray executor: #748
    • Support data partitioning, checkpointing, event logging in ray mode.
    • Improved fault tolerence, extensibility, observability, flexibility, and processing performance.
  • 🤖 New OPs for embodied AI: improved processing capability to handle camera-view videos.
  • 🧩 Support OP-level isolated environment maintaining in ray mode to help resolve the dependency confliction issue from different OPs. #892
    • Allow to merge possible environments from different OPs that share common dependencies in different strategies and reuse the created environments.
    • Based on ray runtime environment.

New OPs

  • video_camera_calibration_static_deepcalib_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using DeepCalib. #871
  • video_camera_calibration_static_moge_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using Moge-2. #871
  • video_undistort_mapper: Undistort raw videos with corresponding camera intrinsics and distortion coefficients. #871
  • video_hand_reconstruction_hawor_mapper: Use HaWoR and MoGe-2 for hand reconstruction. #893
  • video_camera_pose_mapper: Extract camera poses with MegaSaM and MoGe-2. #894

Enhancements

  • Allow batch inference for image_captioning_mapper to improve processing performance. #901
  • Optimize the logics of a branch by avoiding unnecessary function calls. #903 '
  • Refactor Operator Search and Metadata Extraction for Enhanced Accuracy. #889
  • Allow to return meta infos only for extract_keyframes func and remove the sample info in error logs to reduce the size of logs. #904
  • Reduce the memory usage in convert_to_absolute_paths func by iterating only over the specified columns. #907
  • Reorganize the main readme and update the tutorials in the playground to the latest version. #908
  • Optimize issue templates: emphasize English usage and add Q&A Copilot check. #912
  • Convert abs path for dataset in object store. #913

Fixed Bugs

  • Fix the bug to make minhash deduplicator be able to trace all duplicate items. #906
  • Fix the "multiple values for num_proc" bug in TextFormmater. #905
  • Fix the homepage rendering issue and remove outdated OP docs. #910
  • Fix several bugs in test stability and robustness. #918

Acknowledgement

  • @dubin555 helps to improve the processing performance of some OPs and funcs. #901 #903
  • @HunterLine helps to fix a bug in minhash deduplicator to trace all duplicate items. #906

New Contributors

All Contributors

@HYLcool @dubin555 @claude @Qirui-jiao @cmgzn @Cathy0908 @Dludora @yxdyc @gemini-code-assist @HunterLine @ext.wanghao204 @cyruszhang

Full Changelog: v1.4.6...v1.5.0

Release v1.4.6: introduce Q&A Copilot; Video bytes I/O; Tracer for Ray mode

02 Feb 12:53
a1596f9

Choose a tag to compare

Major Updates

  • 🤖 Our Q&A copilot is introduced to resolve questions from users. Now the robot is available in the docs, DingTalk group, Discord, etc. #891
  • 🎬 I/O for video bytes: support bytes reading/storing for videos. #882
  • 🫆 Tracer for ray mode: now the tracer supports to trace changed samples in ray mode. #885

Enhancements

  • Prepare a new dockerfile for use case of embodied AI, and update the cuda/system/... versions of the basic docker image. #887
  • Add Copilot News & Refined DingTalk link/QR code & Discord link/QR code in the docs. #891
  • Convert the word retrieval from lists to sets to speed up two OPs. #890
  • Add a new workflow to automatically fetch the traffic report from github insigts. #899 #900

Fixed Bugs

  • Fix TypeError when using field_types in YAML config for RequiredFieldsValidator. #886
  • Replace the deprecated concurrency parameter with compute parameter in the ray.data.Dataset.map_batches() call. #888
  • Prevent divide-by-zero in calculate_ray_np when Ray cluster not ready. #864
  • Add thread limiting for multi-process workloads to prevent over-subscription. #877
  • Fix the bug where the unittest of standalone mode could be stuck. #896
  • Update several out-of-date links in the docs. #898

Acknowledgement

  • @dubin555 helps to fix several bugs and enhance the processing performance for some OPs. #886 #890
  • @xyuzh helps to update the ray usage to the latest version in some OPs, fix some bugs and optimize the parallel strategies. #888 #864
  • @XinyuLiu1999 helps to fix a bug of over-subscription on multi-process workloads. #877

Full Changelog: v1.4.5...v1.4.6

Release v1.4.5: Embodied-AI OPs; Doc System Upgrading

13 Jan 06:36
923faf3

Choose a tag to compare

Major Updates

  • Add several new OPs for embodied-AI.
  • Upgrade to the documentation system: #842
    • transition the documentation generation and deployment to a unified Sphinx-based framework.
    • architectures, styles are maintained as an isolated repo. It will be pulled before building the docs of each sub-repo.

New OPs

Mapper

  • video_captioning_from_vlm_mapper: generate video captions from latest VLMs (e.g., Qwen3-VL). #820
  • video_object_segmenting_mapper: perform text-guided semantic segmentation of valid objects throughout the video (using YOLOE and SAM2), with support for saving segmentation visualization results. #801
  • video_depth_estimation_mapper: perform depth estimation on the video, with support for saving both visualization results and point cloud data. #801
  • image_mmpose_mapper: perform human keypoint detection inference using MMPose models. #800
  • image_tagging_vlm_mapper: generates image tags with VLMs. #800
  • image_sam_3d_body_mapper: perform single-image full-body 3D human mesh recovery (HMR) with the promptable model SAM 3D Body (3DB). #843
  • s3_download_file_mapper: download files from S3 to local files or load them into memory. #839
  • s3_upload_file_mapper: upload local files to S3 and update paths to S3 URLs. #839

Filter

  • text_tagging_by_prompt_mapper: generate text tags using prompt with LLM. #408
  • image_subplot_filter: detect and remove samples with images containing subplots. #840 #822
  • video_motion_score_ptlflow_filter: a new motion score filter where the optical flows are computed by the ptlflow library. #824

Deduplicator

  • document_minhash_deduplicator_with_uid: a more robust version of document_minhash_deduplicator for datasets with unique ID for each sample. #832 #677
  • ray_bts_minhash_deduplicator_with_uid: a more robust version of ray_bts_minhash_deduplicator for datasets with unique ID for each sample. #832 #677
  • ray_bts_minhash_cpp_deduplicator: enhance the performance of the basic BTS MinHash deduplicator by migrating its computationally intensive parts to C++ and Cython. #851

Pipeline

A new type of OP, which allows combine multiple OPs into one pipeline, or integrating a whole pipeline that is not easy to split into multiple atomic OPs. #835

  • ray_vllm_engine_pipeline: basic OP for making use of the vLLM engine of Ray.
  • llm_ray_vllm_engine_pipeline: generate response with LLMs using vLLM engine on Ray.
  • vlm_ray_vllm_engine_pipeline: generate response with VLMs using vLLM engine on Ray.

Enhancements

  • Several major dependencies of data-juicer are updated to the (nearly) latest version. #820
  • Rename and align some core arguments of base OP about resource allocation to the ones in Ray. #837
  • Refine the badges on the homepage to enhance user experience. #841
  • Support to specify extra arguments for ffmpeg for some video OPs. #847
  • Update Used by & Valuable Feedback from list to add new customers/users of Data-Juicer. #852
  • Improve Ray-based deduplicators by lazily initializing actors, allowing the cluster to autoscale before actors consume resources. #855
  • Improve the RayS3DataLoadStrategy class to provide better format detection, more informative logging, and support for loading multiple files from S3 directories in Ray mode. #860
  • Support OpenAI Reponses API. #856
  • Use consistent key naming and allow to include extra fields in the tracer output with trace_keys arg. #873 #874
  • Support bytes data as input and add the auto_op_parallelism parameter to control whether to enable automatic calculation of OP parallelism. #867
  • Support to save optical flows computed in the OP. #824
  • Update the base image of official data-juicer docker image to cuda12.6.3 and ubuntu 24.04; update the python version to py311; install several embodied-ai-related packages. #881
  • Add memory reservation parameters for Ray minhash deduplication to allow users to reserve memory for actors and tasks. #863

Fixed Bugs

  • Fix the issue where export_aws_credentials cannot be read properly in RayExecutor when using S3 export paths. #834
  • Change the logger level of import timing from info to debug. #859
  • Cache the dataset columns once at the start of process() and pass the cached set through the operator pipeline to fix the issue where Ray's Dataset.columns() breaks streaming pipelines when called repeatedly during operator processing. #854
  • Fix out-of-date URLs of PAI demos. #865
  • Limit versions of several dependencies to avoid new issues from their latest versions. #876
  • Use mount disk to avoid "No space" error on the / path when building docs. #857
  • Use persist to avoid OOM in distributed deduplication of pyspark version. #836 #586
  • Fix the issue where the number of OPs change can not trigger the OP doc building hook. #824

Acknowledgement

  • @kyo-tom helps fix several bugs and enhance the functions of I/O, and implement 2 new OPs for s3. #834 #860 #839
  • @xyuzh helps fix and enhance the core parts of ray distributed mode according to the professional techs from Ray repo. #859 #855 #854 #863
  • @JohnGiorgi helps to support new type of OpenAI API and enhance the tracer. #856 #873 #874
  • @coolderli helps to optimize the spark distributed deduplication to avoid OOM. #586
  • @ZiyiTsang helps to implement a new OP image_subplot_filter. #822

Full Changelog: v1.4.4...v1.4.5

Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support

01 Dec 04:06
deb99e5

Choose a tag to compare

Major Updates

  • 🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. #788
  • 🧩 The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. #817 #827 #830
  • 🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. #806

New OPs

  • detect_main_character_mapper: Extract all main character names based on the given image and its caption. #795
  • detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) #795
  • detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. #795
  • vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. #804
  • video_whole_body_pose_estimation_mapper : Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. #812
  • video_hand_reconstruction_mapper : Use the WiLoR model for hand localization and reconstruction. #818

Enhancements

  • Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. #778 #819
  • Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. #790
  • Optimized the build_op_doc hook for more reliable documentation generation. #794
  • Improved auto num_proc calculation in Ray mode for better resource utilization across operators. #789 #825
  • Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. #803
  • Updated repository URLs and links across the project for consistency and correctness. #805
  • Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. #826 #829
  • Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. #798

Fixed Bugs

  • Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. #791
  • Fixed a Ray connection error by properly passing the config parameter through resource utility functions. #808
  • Fixed several CUDA-based operators to use internal resource monitor. #809
  • Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. #803
  • Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. #814
  • Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. #816
  • Resolved a bug in trace_filter by excluding the __dj_stats__ column during dataset comparison. #828
  • Fix several typos in video_split_by_scene_mapper. #744

Acknowledgement

Full Changelog: v1.4.3...v1.4.4

Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox

11 Sep 09:11
613882b

Choose a tag to compare

Major Updates

  • 🤝 OP Document Updates: Optimized multi-version docs; Doc strings are rewritten and enhanced by qwen-max. #755 #765 #768 #769 #787
  • 💪🏻 Auto Parallelism Optimization: support cpu/gpu/mem requirement specification for each OP; optimize calculate_np for ray mode. #679 #774 #782 #786
  • 🛠️ Sandbox Optimization: support iterative pipelines and early-stop targets; refactor the context infos; a new example on auto prompt optimization and several related hooks are added. #757
  • 📈 Upgrade spacy from 3.8.0 to 3.8.7 due to the previous one is yanked. #763

New OPs

  • image_detection_yolo_mapper: perform object detection (with YOLO) on images and return the bounding box values and class labels. #764
  • optimize_prompt_mapper: optimize prompts based on the existing ones. #757

Enhancements

  • Support shard_size and extra args for write methods in export_extra_args for RayExporter. #739
  • Support min/max_closed_interval args to control filtering with open/closed intervals and reversed_range arg to allow keeping samples outside a specified range for Filters. #741
  • Support API models for existing optimize_qa_mapper. #771

Fixed Bugs

  • Fix and re-enable the disabled op_list_to_trace argument. #766
  • Add missing skip tag to several API-based test cases for forked repos. #767
  • Limit the version of transformers to "<4.55.0" to avoid computing on None value. #781
  • Fix out-of-date invoking methods in several tools. #785 (from issue #750)
  • Fix 500 error in API service. #785 (from issue #777)
  • Remove specified_xxx_filter from NON_STATS_FILTER. #785 (from issue #783)

Full Changelog: v1.4.2...v1.4.3

Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

18 Aug 03:22
14f6594

Choose a tag to compare

Major Updates

  • 💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
  • 🧩 5 OPs for data attribution are added. #735
  • 🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
  • 🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

New Operators

Filter

  • Validation-free
    • llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
    • instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
  • Validation-based
    • in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
    • llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
    • text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

Enhancements

  • A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
  • Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
  • Support custom save_dir for OPs that produce extra multimodal data. #751
  • Add official and detailed docs about Data-Juicer Agent. #759
  • Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
  • Refining developer guide for better practice on building new OPs. #760

Bugs Fixed

  • Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
  • Fix some test cases. #754

Acknowledgement

Full Changelog: v1.4.1...v1.4.2

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

16 Jul 13:05
7505686

Choose a tag to compare

Major Updates

  • 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
  • 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
  • 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
  • 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
  • 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

  • download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

  • New analysis method: correlation analysis among stats is added. #663
  • Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
  • The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
  • Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
  • Support store and process bytes data of images in the dataset. #725

Bugs Fixed

  • The wheel & docker image building bug is fixed. #706
  • Fix bugs in log_summarization. #710
  • Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

  • @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
  • @ayushdg helps to support a GPU-version Minhash deduplicator. #644
  • @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: v1.4.0...v1.4.1

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

13 Jun 11:43
714df97

Choose a tag to compare

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.


🔧 Major Refactors & Improvements

  • 🔄 Sandbox Usability (#686):

    • Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
    • Includes the InternVL example as a showcase.
  • 📘 DJ-Doc Redesign (#675):

    • Now with multilingual support (English / Chinese) and a modernized style.
  • 📦 Dependency Management Update (#660, #680):

    • Migrated to uv for faster dependency resolution.
    • Added sub-groups for better organization.

🌍 New Features & Integrations (#683, #688, #692)

  • 🆕 Additional Repo Supported:

  • 📜 DJ-Awesome-List:

    • A survey paper accepted by TPAMI'25!
  • 🧪 Synthetic Benchmark Added:

    • DetailMaster – a new benchmark for synthetic data evaluation.
  • 🛠️ New Operators Introduced (#673, #701):

    • llm_analysis_filter
    • general_field_filter

🚀 Core Optimizations & Bug Fixes

  • Ray Executor Enhancements (#697):

    • File extension detection added.
    • Support for more data formats.
  • ⏱️ Startup Time Optimization:

    • Improved startup performance. (#684)
  • 🧠 Text Embedding Support:

    • Added support for text embedding via API and local model. (#681)
  • 🐳 Docker Build Improvement:

    • Ignore installed distutils libraries during Docker image building. (#668)
  • 🛠️ Mapper Module Fix:

    • Fixed issue with module initialization. (#700)
  • 🗑️ Warning Suppression:

    • Suppressed unnecessary warnings from fasttext. (#696)

📚 Full Changelog

View all changes since v1.3.3 →

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

09 May 10:20
444537e

Choose a tag to compare

Major Updates

  • 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
  • Add new OPs and recipes for Img-Diff. #658

Enhancements

  • Support HF llm for two llm_xxx_score_filter OPs. #655
  • Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
  • Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

  • Address possibly missing cfg in unify_format. #653
  • Improve clarity & fix bad links for some docs. #659

Acknowledgement

Full Changelog: v1.3.2...v1.3.3

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

25 Apr 11:17
2172698

Choose a tag to compare

What's Changed

  • Human OP enhancements, in #642 #645
    • update label-studio version
    • make service script more robust
    • add documentation
    • optimizing fields mapping
  • OP efficiency optimization of document_minhash_deduplicator, in #639
  • set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
  • fix date typo by in #648
  • Fix docker building failure in #650
  • Fix StreamToLoguru compatibility issue with torch._dynamo in #651
  • add init file for annotation module, fix dj-process command error in #652

New Contributor