Releases: datajuicer/data-juicer
Release v1.5.0: Partitioned Ray Executor; Embodied-AI OPs; OP-level Env Management
Major Updates
- 📊 Stats: 244 files changed with 22,394 additions and 2,053 deletions, from 12 contributors
- 🗂️ New partitioned ray executor: #748
- Support data partitioning, checkpointing, event logging in ray mode.
- Improved fault tolerence, extensibility, observability, flexibility, and processing performance.
- 🤖 New OPs for embodied AI: improved processing capability to handle camera-view videos.
- 🧩 Support OP-level isolated environment maintaining in ray mode to help resolve the dependency confliction issue from different OPs. #892
- Allow to merge possible environments from different OPs that share common dependencies in different strategies and reuse the created environments.
- Based on ray runtime environment.
New OPs
video_camera_calibration_static_deepcalib_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using DeepCalib. #871video_camera_calibration_static_moge_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using Moge-2. #871video_undistort_mapper: Undistort raw videos with corresponding camera intrinsics and distortion coefficients. #871video_hand_reconstruction_hawor_mapper: Use HaWoR and MoGe-2 for hand reconstruction. #893video_camera_pose_mapper: Extract camera poses with MegaSaM and MoGe-2. #894
Enhancements
- Allow batch inference for
image_captioning_mapperto improve processing performance. #901 - Optimize the logics of a branch by avoiding unnecessary function calls. #903 '
- Refactor Operator Search and Metadata Extraction for Enhanced Accuracy. #889
- Allow to return meta infos only for extract_keyframes func and remove the sample info in error logs to reduce the size of logs. #904
- Reduce the memory usage in convert_to_absolute_paths func by iterating only over the specified columns. #907
- Reorganize the main readme and update the tutorials in the playground to the latest version. #908
- Optimize issue templates: emphasize English usage and add Q&A Copilot check. #912
- Convert abs path for dataset in object store. #913
Fixed Bugs
- Fix the bug to make minhash deduplicator be able to trace all duplicate items. #906
- Fix the "multiple values for num_proc" bug in TextFormmater. #905
- Fix the homepage rendering issue and remove outdated OP docs. #910
- Fix several bugs in test stability and robustness. #918
Acknowledgement
- @dubin555 helps to improve the processing performance of some OPs and funcs. #901 #903
- @HunterLine helps to fix a bug in minhash deduplicator to trace all duplicate items. #906
New Contributors
- @HunterLine made their first contribution in #906
- @Dludora made their first contribution in #907
All Contributors
@HYLcool @dubin555 @claude @Qirui-jiao @cmgzn @Cathy0908 @Dludora @yxdyc @gemini-code-assist @HunterLine @ext.wanghao204 @cyruszhang
Full Changelog: v1.4.6...v1.5.0
Release v1.4.6: introduce Q&A Copilot; Video bytes I/O; Tracer for Ray mode
Major Updates
- 🤖 Our Q&A copilot is introduced to resolve questions from users. Now the robot is available in the docs, DingTalk group, Discord, etc. #891
- 🎬 I/O for video bytes: support bytes reading/storing for videos. #882
- Tracer for ray mode: now the tracer supports to trace changed samples in ray mode. #885
Enhancements
- Prepare a new dockerfile for use case of embodied AI, and update the cuda/system/... versions of the basic docker image. #887
- Add Copilot News & Refined DingTalk link/QR code & Discord link/QR code in the docs. #891
- Convert the word retrieval from lists to sets to speed up two OPs. #890
- Add a new workflow to automatically fetch the traffic report from github insigts. #899 #900
Fixed Bugs
- Fix
TypeErrorwhen usingfield_typesin YAML config forRequiredFieldsValidator. #886 - Replace the deprecated
concurrencyparameter withcomputeparameter in the ray.data.Dataset.map_batches() call. #888 - Prevent divide-by-zero in calculate_ray_np when Ray cluster not ready. #864
- Add thread limiting for multi-process workloads to prevent over-subscription. #877
- Fix the bug where the unittest of standalone mode could be stuck. #896
- Update several out-of-date links in the docs. #898
Acknowledgement
- @dubin555 helps to fix several bugs and enhance the processing performance for some OPs. #886 #890
- @xyuzh helps to update the ray usage to the latest version in some OPs, fix some bugs and optimize the parallel strategies. #888 #864
- @XinyuLiu1999 helps to fix a bug of over-subscription on multi-process workloads. #877
Full Changelog: v1.4.5...v1.4.6
Release v1.4.5: Embodied-AI OPs; Doc System Upgrading
Major Updates
- Add several new OPs for embodied-AI.
- Upgrade to the documentation system: #842
- transition the documentation generation and deployment to a unified Sphinx-based framework.
- architectures, styles are maintained as an isolated repo. It will be pulled before building the docs of each sub-repo.
New OPs
Mapper
video_captioning_from_vlm_mapper: generate video captions from latest VLMs (e.g., Qwen3-VL). #820video_object_segmenting_mapper: perform text-guided semantic segmentation of valid objects throughout the video (using YOLOE and SAM2), with support for saving segmentation visualization results. #801video_depth_estimation_mapper: perform depth estimation on the video, with support for saving both visualization results and point cloud data. #801image_mmpose_mapper: perform human keypoint detection inference using MMPose models. #800image_tagging_vlm_mapper: generates image tags with VLMs. #800image_sam_3d_body_mapper: perform single-image full-body 3D human mesh recovery (HMR) with the promptable model SAM 3D Body (3DB). #843s3_download_file_mapper: download files from S3 to local files or load them into memory. #839s3_upload_file_mapper: upload local files to S3 and update paths to S3 URLs. #839
Filter
text_tagging_by_prompt_mapper: generate text tags using prompt with LLM. #408image_subplot_filter: detect and remove samples with images containing subplots. #840 #822video_motion_score_ptlflow_filter: a new motion score filter where the optical flows are computed by the ptlflow library. #824
Deduplicator
document_minhash_deduplicator_with_uid: a more robust version ofdocument_minhash_deduplicatorfor datasets with unique ID for each sample. #832 #677ray_bts_minhash_deduplicator_with_uid: a more robust version ofray_bts_minhash_deduplicatorfor datasets with unique ID for each sample. #832 #677ray_bts_minhash_cpp_deduplicator: enhance the performance of the basic BTS MinHash deduplicator by migrating its computationally intensive parts to C++ and Cython. #851
Pipeline
A new type of OP, which allows combine multiple OPs into one pipeline, or integrating a whole pipeline that is not easy to split into multiple atomic OPs. #835
ray_vllm_engine_pipeline: basic OP for making use of the vLLM engine of Ray.llm_ray_vllm_engine_pipeline: generate response with LLMs using vLLM engine on Ray.vlm_ray_vllm_engine_pipeline: generate response with VLMs using vLLM engine on Ray.
Enhancements
- Several major dependencies of data-juicer are updated to the (nearly) latest version. #820
- Rename and align some core arguments of base OP about resource allocation to the ones in Ray. #837
- Refine the badges on the homepage to enhance user experience. #841
- Support to specify extra arguments for ffmpeg for some video OPs. #847
- Update
Used by & Valuable Feedback fromlist to add new customers/users of Data-Juicer. #852 - Improve Ray-based deduplicators by lazily initializing actors, allowing the cluster to autoscale before actors consume resources. #855
- Improve the
RayS3DataLoadStrategyclass to provide better format detection, more informative logging, and support for loading multiple files from S3 directories in Ray mode. #860 - Support OpenAI Reponses API. #856
- Use consistent key naming and allow to include extra fields in the tracer output with
trace_keysarg. #873 #874 - Support bytes data as input and add the
auto_op_parallelismparameter to control whether to enable automatic calculation of OP parallelism. #867 - Support to save optical flows computed in the OP. #824
- Update the base image of official data-juicer docker image to cuda12.6.3 and ubuntu 24.04; update the python version to py311; install several embodied-ai-related packages. #881
- Add memory reservation parameters for Ray minhash deduplication to allow users to reserve memory for actors and tasks. #863
Fixed Bugs
- Fix the issue where
export_aws_credentialscannot be read properly in RayExecutor when using S3 export paths. #834 - Change the logger level of import timing from
infotodebug. #859 - Cache the dataset columns once at the start of
process()and pass the cached set through the operator pipeline to fix the issue where Ray'sDataset.columns()breaks streaming pipelines when called repeatedly during operator processing. #854 - Fix out-of-date URLs of PAI demos. #865
- Limit versions of several dependencies to avoid new issues from their latest versions. #876
- Use mount disk to avoid "No space" error on the
/path when building docs. #857 - Use
persistto avoid OOM in distributed deduplication of pyspark version. #836 #586 - Fix the issue where the number of OPs change can not trigger the OP doc building hook. #824
Acknowledgement
- @kyo-tom helps fix several bugs and enhance the functions of I/O, and implement 2 new OPs for s3. #834 #860 #839
- @xyuzh helps fix and enhance the core parts of ray distributed mode according to the professional techs from Ray repo. #859 #855 #854 #863
- @JohnGiorgi helps to support new type of OpenAI API and enhance the tracer. #856 #873 #874
- @coolderli helps to optimize the spark distributed deduplication to avoid OOM. #586
- @ZiyiTsang helps to implement a new OP
image_subplot_filter. #822
Full Changelog: v1.4.4...v1.4.5
Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support
Major Updates
- 🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. #788
- 🧩 The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. #817 #827 #830
- 🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. #806
New OPs
detect_main_character_mapper: Extract all main character names based on the given image and its caption. #795detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) #795detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. #795vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. #804video_whole_body_pose_estimation_mapper: Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. #812video_hand_reconstruction_mapper: Use the WiLoR model for hand localization and reconstruction. #818
Enhancements
- Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. #778 #819
- Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. #790
- Optimized the build_op_doc hook for more reliable documentation generation. #794
- Improved auto num_proc calculation in Ray mode for better resource utilization across operators. #789 #825
- Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. #803
- Updated repository URLs and links across the project for consistency and correctness. #805
- Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. #826 #829
- Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. #798
Fixed Bugs
- Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. #791
- Fixed a Ray connection error by properly passing the config parameter through resource utility functions. #808
- Fixed several CUDA-based operators to use internal resource monitor. #809
- Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. #803
- Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. #814
- Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. #816
- Resolved a bug in trace_filter by excluding the
__dj_stats__column during dataset comparison. #828 - Fix several typos in
video_split_by_scene_mapper. #744
Acknowledgement
- @kyo-tom helps to fix the ray connection error in #808
- @liuyuhanalex helps to fix several small typos in #744
Full Changelog: v1.4.3...v1.4.4
Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox
Major Updates
- 🤝 OP Document Updates: Optimized multi-version docs; Doc strings are rewritten and enhanced by qwen-max. #755 #765 #768 #769 #787
- 💪🏻 Auto Parallelism Optimization: support cpu/gpu/mem requirement specification for each OP; optimize
calculate_npfor ray mode. #679 #774 #782 #786 - 🛠️ Sandbox Optimization: support iterative pipelines and early-stop targets; refactor the context infos; a new example on auto prompt optimization and several related hooks are added. #757
- 📈 Upgrade spacy from 3.8.0 to 3.8.7 due to the previous one is yanked. #763
New OPs
image_detection_yolo_mapper: perform object detection (with YOLO) on images and return the bounding box values and class labels. #764optimize_prompt_mapper: optimize prompts based on the existing ones. #757
Enhancements
- Support shard_size and extra args for write methods in
export_extra_argsfor RayExporter. #739 - Support min/max_closed_interval args to control filtering with open/closed intervals and reversed_range arg to allow keeping samples outside a specified range for Filters. #741
- Support API models for existing
optimize_qa_mapper. #771
Fixed Bugs
- Fix and re-enable the disabled op_list_to_trace argument. #766
- Add missing
skiptag to several API-based test cases for forked repos. #767 - Limit the version of
transformersto "<4.55.0" to avoid computing on None value. #781 - Fix out-of-date invoking methods in several tools. #785 (from issue #750)
- Fix 500 error in API service. #785 (from issue #777)
- Remove
specified_xxx_filterfrom NON_STATS_FILTER. #785 (from issue #783)
Full Changelog: v1.4.2...v1.4.3
Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"
Major Updates
- 💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
- 🧩 5 OPs for data attribution are added. #735
- 🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument
custom_operator_paths. #758 - 🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760
New Operators
Filter
- Validation-free
llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
- Validation-based
in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735
Enhancements
- A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
- Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
- Support custom save_dir for OPs that produce extra multimodal data. #751
- Add official and detailed docs about Data-Juicer Agent. #759
- Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
- Refining developer guide for better practice on building new OPs. #760
Bugs Fixed
- Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
- Fix some test cases. #754
Acknowledgement
- @ShenQianli made their first contribution to 5 new OPs. #735
Full Changelog: v1.4.1...v1.4.2
Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.
Major Updates
- 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
- 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
- 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
- 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
- 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738
New Operators
download_file_mapperdownloads data from URLs to local files or specified fields. #709
Enhancements
- New analysis method: correlation analysis among stats is added. #663
- Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
- The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
- Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
- Support store and process bytes data of images in the dataset. #725
Bugs Fixed
- The wheel & docker image building bug is fixed. #706
- Fix bugs in log_summarization. #710
- Fix "no module named data_juicer" error after installing from the wheel file. #727
Acknowledgement
- @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
- @ayushdg helps to support a GPU-version Minhash deduplicator. #644
- @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730
Full Changelog: v1.4.0...v1.4.1
v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)
Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.
🔧 Major Refactors & Improvements
-
🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
-
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
-
📦 Dependency Management Update (#660, #680):
- Migrated to
uvfor faster dependency resolution. - Added sub-groups for better organization.
- Migrated to
🌍 New Features & Integrations (#683, #688, #692)
-
🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
-
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
-
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
-
🛠️ New Operators Introduced (#673, #701):
llm_analysis_filtergeneral_field_filter
🚀 Core Optimizations & Bug Fixes
-
✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
-
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
-
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
-
🐳 Docker Build Improvement:
- Ignore installed
distutilslibraries during Docker image building. (#668)
- Ignore installed
-
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
-
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)
📚 Full Changelog
Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.
Major Updates
- 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
- Add new OPs and recipes for Img-Diff. #658
Enhancements
- Support HF llm for two llm_xxx_score_filter OPs. #655
- Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
- Split standalone and distributed unit tests to save time when re-running failed ones. #666
Bugs Fixed
- Address possibly missing cfg in
unify_format. #653 - Improve clarity & fix bad links for some docs. #659
Acknowledgement
Full Changelog: v1.3.2...v1.3.3
Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes
What's Changed
- Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
- OP efficiency optimization of
document_minhash_deduplicator, in #639 - set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
- fix date typo by in #648
- Fix docker building failure in #650
- Fix StreamToLoguru compatibility issue with torch._dynamo in #651
- add init file for annotation module, fix dj-process command error in #652