Release v0.2.0: Multimodal Support & DJ-SORA
New Features
- 🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
- 🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
- 💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
- 💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174
New OPs
Multimodal
video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227video_captioning_from_audio_mapper: captions a video according to its audio streams. #227image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200
Video
Filter
video_duration_filter: keeps samples whose videos' durations are within a specified range. #227video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227video_resolution_filter: filters samples according to the resolution of videos in them. #227video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227video_motion_score_filter: keeps samples with video motion scores within a specific range. #227
Mapper
video_split_by_scene_mapper: splits videos into scene clips. #227video_split_by_duration_mapper: splits videos by specified duration interval. #227video_split_by_key_frame_mapper: splits videos by their keyframes. #227video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227
Deduplicator
video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227
Audio
audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227
Image
image_blur_mapper: adds random noises to images to blur them. #180image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227
Document Updates
- "Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
- Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
- Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
- OP Insight Visualization Demo code: adds a demo to visualize how each OP works.
Bugs Fixed
- Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
- Fix the bug that some images will be lost when converting their paths to absolute paths. #178
- Fix the dependency problems of OPs who depend on other OPs. #181
- Fix the bug that the
predict.pytool gets stuck on the help page. #183 - Fix
face_area_filter: constrains the detection coordinates within the image. #202 - Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
- Fix or update invalid links in Data-Juicer. #201 #219
Others
- Optimize the model management module. #196 #227
- Optimize the unit test actions. #195 #196 #216 #227
- Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
- Update the docker image with JDK. #208
- Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
- Optimize the generated multimodal data storage. #227
- Support running data-juicer process jobs on Aliyun PAI-DLC. #227
- Better support for multi-machine distributed data processing in Ray mode. #227
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!