26 Feb 05:06

HYLcool

2e62d2a

Release v1.5.0: Partitioned Ray Executor; Embodied-AI OPs; OP-level Env Management Latest

Latest

Major Updates

📊 Stats: 244 files changed with 22,394 additions and 2,053 deletions, from 12 contributors
🗂️ New partitioned ray executor: #748
- Support data partitioning, checkpointing, event logging in ray mode.
- Improved fault tolerence, extensibility, observability, flexibility, and processing performance.
🤖 New OPs for embodied AI: improved processing capability to handle camera-view videos.
🧩 Support OP-level isolated environment maintaining in ray mode to help resolve the dependency confliction issue from different OPs. #892
- Allow to merge possible environments from different OPs that share common dependencies in different strategies and reuse the created environments.
- Based on ray runtime environment.

New OPs

video_camera_calibration_static_deepcalib_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using DeepCalib. #871
video_camera_calibration_static_moge_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using Moge-2. #871
video_undistort_mapper: Undistort raw videos with corresponding camera intrinsics and distortion coefficients. #871
video_hand_reconstruction_hawor_mapper: Use HaWoR and MoGe-2 for hand reconstruction. #893
video_camera_pose_mapper: Extract camera poses with MegaSaM and MoGe-2. #894

Enhancements

Allow batch inference for image_captioning_mapper to improve processing performance. #901
Optimize the logics of a branch by avoiding unnecessary function calls. #903 '
Refactor Operator Search and Metadata Extraction for Enhanced Accuracy. #889
Allow to return meta infos only for extract_keyframes func and remove the sample info in error logs to reduce the size of logs. #904
Reduce the memory usage in convert_to_absolute_paths func by iterating only over the specified columns. #907
Reorganize the main readme and update the tutorials in the playground to the latest version. #908
Optimize issue templates: emphasize English usage and add Q&A Copilot check. #912
Convert abs path for dataset in object store. #913

Fixed Bugs

Fix the bug to make minhash deduplicator be able to trace all duplicate items. #906
Fix the "multiple values for num_proc" bug in TextFormmater. #905
Fix the homepage rendering issue and remove outdated OP docs. #910
Fix several bugs in test stability and robustness. #918

Acknowledgement

@dubin555 helps to improve the processing performance of some OPs and funcs. #901 #903
@HunterLine helps to fix a bug in minhash deduplicator to trace all duplicate items. #906

New Contributors

@HunterLine made their first contribution in #906
@Dludora made their first contribution in #907

All Contributors

@HYLcool @dubin555 @claude @Qirui-jiao @cmgzn @Cathy0908 @Dludora @yxdyc @gemini-code-assist @HunterLine @ext.wanghao204 @cyruszhang

Full Changelog: v1.4.6...v1.5.0

Contributors

claude, cyruszhang, and 9 other contributors

Assets 3

02 Feb 12:53

HYLcool

v1.4.6

a1596f9

Release v1.4.6: introduce Q&A Copilot; Video bytes I/O; Tracer for Ray mode

Major Updates

🤖 Our Q&A copilot is introduced to resolve questions from users. Now the robot is available in the docs, DingTalk group, Discord, etc. #891
🎬 I/O for video bytes: support bytes reading/storing for videos. #882
🫆 Tracer for ray mode: now the tracer supports to trace changed samples in ray mode. #885

Enhancements

Prepare a new dockerfile for use case of embodied AI, and update the cuda/system/... versions of the basic docker image. #887
Add Copilot News & Refined DingTalk link/QR code & Discord link/QR code in the docs. #891
Convert the word retrieval from lists to sets to speed up two OPs. #890
Add a new workflow to automatically fetch the traffic report from github insigts. #899 #900

Fixed Bugs

Fix TypeError when using field_types in YAML config for RequiredFieldsValidator. #886
Replace the deprecated concurrency parameter with compute parameter in the ray.data.Dataset.map_batches() call. #888
Prevent divide-by-zero in calculate_ray_np when Ray cluster not ready. #864
Add thread limiting for multi-process workloads to prevent over-subscription. #877
Fix the bug where the unittest of standalone mode could be stuck. #896
Update several out-of-date links in the docs. #898

Acknowledgement

@dubin555 helps to fix several bugs and enhance the processing performance for some OPs. #886 #890
@xyuzh helps to update the ray usage to the latest version in some OPs, fix some bugs and optimize the parallel strategies. #888 #864
@XinyuLiu1999 helps to fix a bug of over-subscription on multi-process workloads. #877

Full Changelog: v1.4.5...v1.4.6

Contributors

dubin555, xyuzh, and XinyuLiu1999

Assets 3

13 Jan 06:36

HYLcool

v1.4.5

923faf3

Release v1.4.5: Embodied-AI OPs; Doc System Upgrading

Major Updates

Add several new OPs for embodied-AI.
Upgrade to the documentation system: #842
- transition the documentation generation and deployment to a unified Sphinx-based framework.
- architectures, styles are maintained as an isolated repo. It will be pulled before building the docs of each sub-repo.

New OPs

Mapper

video_captioning_from_vlm_mapper: generate video captions from latest VLMs (e.g., Qwen3-VL). #820
video_object_segmenting_mapper: perform text-guided semantic segmentation of valid objects throughout the video (using YOLOE and SAM2), with support for saving segmentation visualization results. #801
video_depth_estimation_mapper: perform depth estimation on the video, with support for saving both visualization results and point cloud data. #801
image_mmpose_mapper: perform human keypoint detection inference using MMPose models. #800
image_tagging_vlm_mapper: generates image tags with VLMs. #800
image_sam_3d_body_mapper: perform single-image full-body 3D human mesh recovery (HMR) with the promptable model SAM 3D Body (3DB). #843
s3_download_file_mapper: download files from S3 to local files or load them into memory. #839
s3_upload_file_mapper: upload local files to S3 and update paths to S3 URLs. #839

Filter

text_tagging_by_prompt_mapper: generate text tags using prompt with LLM. #408
image_subplot_filter: detect and remove samples with images containing subplots. #840 #822
video_motion_score_ptlflow_filter: a new motion score filter where the optical flows are computed by the ptlflow library. #824

Deduplicator

document_minhash_deduplicator_with_uid: a more robust version of document_minhash_deduplicator for datasets with unique ID for each sample. #832 #677
ray_bts_minhash_deduplicator_with_uid: a more robust version of ray_bts_minhash_deduplicator for datasets with unique ID for each sample. #832 #677
ray_bts_minhash_cpp_deduplicator: enhance the performance of the basic BTS MinHash deduplicator by migrating its computationally intensive parts to C++ and Cython. #851

Pipeline

A new type of OP, which allows combine multiple OPs into one pipeline, or integrating a whole pipeline that is not easy to split into multiple atomic OPs. #835

ray_vllm_engine_pipeline: basic OP for making use of the vLLM engine of Ray.
llm_ray_vllm_engine_pipeline: generate response with LLMs using vLLM engine on Ray.
vlm_ray_vllm_engine_pipeline: generate response with VLMs using vLLM engine on Ray.

Enhancements

Several major dependencies of data-juicer are updated to the (nearly) latest version. #820
Rename and align some core arguments of base OP about resource allocation to the ones in Ray. #837
Refine the badges on the homepage to enhance user experience. #841
Support to specify extra arguments for ffmpeg for some video OPs. #847
Update Used by & Valuable Feedback from list to add new customers/users of Data-Juicer. #852
Improve Ray-based deduplicators by lazily initializing actors, allowing the cluster to autoscale before actors consume resources. #855
Improve the RayS3DataLoadStrategy class to provide better format detection, more informative logging, and support for loading multiple files from S3 directories in Ray mode. #860
Support OpenAI Reponses API. #856
Use consistent key naming and allow to include extra fields in the tracer output with trace_keys arg. #873 #874
Support bytes data as input and add the auto_op_parallelism parameter to control whether to enable automatic calculation of OP parallelism. #867
Support to save optical flows computed in the OP. #824
Update the base image of official data-juicer docker image to cuda12.6.3 and ubuntu 24.04; update the python version to py311; install several embodied-ai-related packages. #881
Add memory reservation parameters for Ray minhash deduplication to allow users to reserve memory for actors and tasks. #863

Fixed Bugs

Fix the issue where export_aws_credentials cannot be read properly in RayExecutor when using S3 export paths. #834
Change the logger level of import timing from info to debug. #859
Cache the dataset columns once at the start of process() and pass the cached set through the operator pipeline to fix the issue where Ray's Dataset.columns() breaks streaming pipelines when called repeatedly during operator processing. #854
Fix out-of-date URLs of PAI demos. #865
Limit versions of several dependencies to avoid new issues from their latest versions. #876
Use mount disk to avoid "No space" error on the / path when building docs. #857
Use persist to avoid OOM in distributed deduplication of pyspark version. #836 #586
Fix the issue where the number of OPs change can not trigger the OP doc building hook. #824

Acknowledgement

@kyo-tom helps fix several bugs and enhance the functions of I/O, and implement 2 new OPs for s3. #834 #860 #839
@xyuzh helps fix and enhance the core parts of ray distributed mode according to the professional techs from Ray repo. #859 #855 #854 #863
@JohnGiorgi helps to support new type of OpenAI API and enhance the tracer. #856 #873 #874
@coolderli helps to optimize the spark distributed deduplication to avoid OOM. #586
@ZiyiTsang helps to implement a new OP image_subplot_filter. #822

Full Changelog: v1.4.4...v1.4.5

Contributors

JohnGiorgi, kyo-tom, and 3 other contributors

Assets 3

01 Dec 04:06

HYLcool

v1.4.4

deb99e5

Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support

Major Updates

🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. #788
🧩 The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. #817 #827 #830
🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. #806

New OPs

detect_main_character_mapper: Extract all main character names based on the given image and its caption. #795
detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) #795
detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. #795
vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. #804
video_whole_body_pose_estimation_mapper : Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. #812
video_hand_reconstruction_mapper : Use the WiLoR model for hand localization and reconstruction. #818

Enhancements

Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. #778 #819
Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. #790
Optimized the build_op_doc hook for more reliable documentation generation. #794
Improved auto num_proc calculation in Ray mode for better resource utilization across operators. #789 #825
Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. #803
Updated repository URLs and links across the project for consistency and correctness. #805
Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. #826 #829
Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. #798

Fixed Bugs

Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. #791
Fixed a Ray connection error by properly passing the config parameter through resource utility functions. #808
Fixed several CUDA-based operators to use internal resource monitor. #809
Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. #803
Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. #814
Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. #816
Resolved a bug in trace_filter by excluding the __dj_stats__ column during dataset comparison. #828
Fix several typos in video_split_by_scene_mapper. #744

Acknowledgement

@kyo-tom helps to fix the ray connection error in #808
@liuyuhanalex helps to fix several small typos in #744

Full Changelog: v1.4.3...v1.4.4

Contributors

liuyuhanalex and kyo-tom

Assets 3

11 Sep 09:11

HYLcool

v1.4.3

613882b

Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox

Major Updates

🤝 OP Document Updates: Optimized multi-version docs; Doc strings are rewritten and enhanced by qwen-max. #755 #765 #768 #769 #787
💪🏻 Auto Parallelism Optimization: support cpu/gpu/mem requirement specification for each OP; optimize calculate_np for ray mode. #679 #774 #782 #786
🛠️ Sandbox Optimization: support iterative pipelines and early-stop targets; refactor the context infos; a new example on auto prompt optimization and several related hooks are added. #757
📈 Upgrade spacy from 3.8.0 to 3.8.7 due to the previous one is yanked. #763

New OPs

image_detection_yolo_mapper: perform object detection (with YOLO) on images and return the bounding box values and class labels. #764
optimize_prompt_mapper: optimize prompts based on the existing ones. #757

Enhancements

Support shard_size and extra args for write methods in export_extra_args for RayExporter. #739
Support min/max_closed_interval args to control filtering with open/closed intervals and reversed_range arg to allow keeping samples outside a specified range for Filters. #741
Support API models for existing optimize_qa_mapper. #771

Fixed Bugs

Fix and re-enable the disabled op_list_to_trace argument. #766
Add missing skip tag to several API-based test cases for forked repos. #767
Limit the version of transformers to "<4.55.0" to avoid computing on None value. #781
Fix out-of-date invoking methods in several tools. #785 (from issue #750)
Fix 500 error in API service. #785 (from issue #777)
Remove specified_xxx_filter from NON_STATS_FILTER. #785 (from issue #783)

Full Changelog: v1.4.2...v1.4.3

Assets 3

18 Aug 03:22

HYLcool

v1.4.2

14f6594

Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Major Updates

💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
🧩 5 OPs for data attribution are added. #735
🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

New Operators

Filter

Validation-free
- llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
- instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
Validation-based
- in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
- llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
- text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

Enhancements

A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
Support custom save_dir for OPs that produce extra multimodal data. #751
Add official and detailed docs about Data-Juicer Agent. #759
Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
Refining developer guide for better practice on building new OPs. #760

Bugs Fixed

Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
Fix some test cases. #754

Acknowledgement

@ShenQianli made their first contribution to 5 new OPs. #735

Full Changelog: v1.4.1...v1.4.2

Contributors

ShenQianli

Assets 3

16 Jul 13:05

HYLcool

v1.4.1

7505686

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

New analysis method: correlation analysis among stats is added. #663
Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
Support store and process bytes data of images in the dataset. #725

Bugs Fixed

The wheel & docker image building bug is fixed. #706
Fix bugs in log_summarization. #710
Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

@fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
@ayushdg helps to support a GPU-version Minhash deduplicator. #644
@ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: v1.4.0...v1.4.1

Contributors

ayushdg, fanronghai, and ricksun2023

Assets 3

13 Jun 11:43

yxdyc

v1.4.0

714df97

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.

🔧 Major Refactors & Improvements

🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
📦 Dependency Management Update (#660, #680):
- Migrated to uv for faster dependency resolution.
- Added sub-groups for better organization.

🌍 New Features & Integrations (#683, #688, #692)

🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
🛠️ New Operators Introduced (#673, #701):
- llm_analysis_filter
- general_field_filter

🚀 Core Optimizations & Bug Fixes

✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
🐳 Docker Build Improvement:
- Ignore installed distutils libraries during Docker image building. (#668)
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)

📚 Full Changelog

View all changes since v1.3.3 →

Assets 3

09 May 10:20

HYLcool

v1.3.3

444537e

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
Add new OPs and recipes for Img-Diff. #658

Enhancements

Support HF llm for two llm_xxx_score_filter OPs. #655
Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

Address possibly missing cfg in unify_format. #653
Improve clarity & fix bad links for some docs. #659

Acknowledgement

@co63oc helps to fix some typos. #654

Full Changelog: v1.3.2...v1.3.3

Contributors

co63oc

Assets 3

25 Apr 11:17

yxdyc

v1.3.2

2172698

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
OP efficiency optimization of document_minhash_deduplicator, in #639
set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
fix date typo by in #648
Fix docker building failure in #650
Fix StreamToLoguru compatibility issue with torch._dynamo in #651
add init file for annotation module, fix dj-process command error in #652

New Contributor

@cmgzn made their first contribution in #651

Contributors

cmgzn

Assets 3

Releases: datajuicer/data-juicer

Release v1.5.0: Partitioned Ray Executor; Embodied-AI OPs; OP-level Env Management

Major Updates

New OPs

Enhancements

Fixed Bugs

Acknowledgement

New Contributors

All Contributors

Contributors

Uh oh!

Release v1.4.6: introduce Q&A Copilot; Video bytes I/O; Tracer for Ray mode

Major Updates

Enhancements

Fixed Bugs

Acknowledgement

Contributors

Uh oh!

Release v1.4.5: Embodied-AI OPs; Doc System Upgrading

Major Updates

New OPs

Mapper

Filter

Deduplicator

Pipeline

Enhancements

Fixed Bugs

Acknowledgement

Contributors

Uh oh!

Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support

Major Updates

New OPs

Enhancements

Fixed Bugs

Acknowledgement

Contributors

Uh oh!

Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox

Major Updates

New OPs

Enhancements

Fixed Bugs

Uh oh!

Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Major Updates

New Operators

Filter

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

New Operators

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

🔧 Major Refactors & Improvements

🌍 New Features & Integrations (#683, #688, #692)

🚀 Core Optimizations & Bug Fixes

📚 Full Changelog

Uh oh!

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

New Contributor

Contributors

Uh oh!