Release v1.4.5: Embodied-AI OPs; Doc System Upgrading
Major Updates
- Add several new OPs for embodied-AI.
- Upgrade to the documentation system: #842
- transition the documentation generation and deployment to a unified Sphinx-based framework.
- architectures, styles are maintained as an isolated repo. It will be pulled before building the docs of each sub-repo.
New OPs
Mapper
video_captioning_from_vlm_mapper: generate video captions from latest VLMs (e.g., Qwen3-VL). #820video_object_segmenting_mapper: perform text-guided semantic segmentation of valid objects throughout the video (using YOLOE and SAM2), with support for saving segmentation visualization results. #801video_depth_estimation_mapper: perform depth estimation on the video, with support for saving both visualization results and point cloud data. #801image_mmpose_mapper: perform human keypoint detection inference using MMPose models. #800image_tagging_vlm_mapper: generates image tags with VLMs. #800image_sam_3d_body_mapper: perform single-image full-body 3D human mesh recovery (HMR) with the promptable model SAM 3D Body (3DB). #843s3_download_file_mapper: download files from S3 to local files or load them into memory. #839s3_upload_file_mapper: upload local files to S3 and update paths to S3 URLs. #839
Filter
text_tagging_by_prompt_mapper: generate text tags using prompt with LLM. #408image_subplot_filter: detect and remove samples with images containing subplots. #840 #822video_motion_score_ptlflow_filter: a new motion score filter where the optical flows are computed by the ptlflow library. #824
Deduplicator
document_minhash_deduplicator_with_uid: a more robust version ofdocument_minhash_deduplicatorfor datasets with unique ID for each sample. #832 #677ray_bts_minhash_deduplicator_with_uid: a more robust version ofray_bts_minhash_deduplicatorfor datasets with unique ID for each sample. #832 #677ray_bts_minhash_cpp_deduplicator: enhance the performance of the basic BTS MinHash deduplicator by migrating its computationally intensive parts to C++ and Cython. #851
Pipeline
A new type of OP, which allows combine multiple OPs into one pipeline, or integrating a whole pipeline that is not easy to split into multiple atomic OPs. #835
ray_vllm_engine_pipeline: basic OP for making use of the vLLM engine of Ray.llm_ray_vllm_engine_pipeline: generate response with LLMs using vLLM engine on Ray.vlm_ray_vllm_engine_pipeline: generate response with VLMs using vLLM engine on Ray.
Enhancements
- Several major dependencies of data-juicer are updated to the (nearly) latest version. #820
- Rename and align some core arguments of base OP about resource allocation to the ones in Ray. #837
- Refine the badges on the homepage to enhance user experience. #841
- Support to specify extra arguments for ffmpeg for some video OPs. #847
- Update
Used by & Valuable Feedback fromlist to add new customers/users of Data-Juicer. #852 - Improve Ray-based deduplicators by lazily initializing actors, allowing the cluster to autoscale before actors consume resources. #855
- Improve the
RayS3DataLoadStrategyclass to provide better format detection, more informative logging, and support for loading multiple files from S3 directories in Ray mode. #860 - Support OpenAI Reponses API. #856
- Use consistent key naming and allow to include extra fields in the tracer output with
trace_keysarg. #873 #874 - Support bytes data as input and add the
auto_op_parallelismparameter to control whether to enable automatic calculation of OP parallelism. #867 - Support to save optical flows computed in the OP. #824
- Update the base image of official data-juicer docker image to cuda12.6.3 and ubuntu 24.04; update the python version to py311; install several embodied-ai-related packages. #881
- Add memory reservation parameters for Ray minhash deduplication to allow users to reserve memory for actors and tasks. #863
Fixed Bugs
- Fix the issue where
export_aws_credentialscannot be read properly in RayExecutor when using S3 export paths. #834 - Change the logger level of import timing from
infotodebug. #859 - Cache the dataset columns once at the start of
process()and pass the cached set through the operator pipeline to fix the issue where Ray'sDataset.columns()breaks streaming pipelines when called repeatedly during operator processing. #854 - Fix out-of-date URLs of PAI demos. #865
- Limit versions of several dependencies to avoid new issues from their latest versions. #876
- Use mount disk to avoid "No space" error on the
/path when building docs. #857 - Use
persistto avoid OOM in distributed deduplication of pyspark version. #836 #586 - Fix the issue where the number of OPs change can not trigger the OP doc building hook. #824
Acknowledgement
- @kyo-tom helps fix several bugs and enhance the functions of I/O, and implement 2 new OPs for s3. #834 #860 #839
- @xyuzh helps fix and enhance the core parts of ray distributed mode according to the professional techs from Ray repo. #859 #855 #854 #863
- @JohnGiorgi helps to support new type of OpenAI API and enhance the tracer. #856 #873 #874
- @coolderli helps to optimize the spark distributed deduplication to avoid OOM. #586
- @ZiyiTsang helps to implement a new OP
image_subplot_filter. #822
Full Changelog: v1.4.4...v1.4.5