-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Data preparation and data storage/format is an important and fairly isolated piece of training functionality, which may be a good match for an efficient Rust implementation. This is particularly important for multimodal data. Below are some features that would be useful to support:
- [P1] Shard data per-node, per-GPU (pre-process offline, or online/iterable).
- [P0] Group examples by length, by number of images, etc - for more efficient GPU utilization and/or for tensor shape compatibility within a mini-batch.
- [P1] Multimodal:
- [P0] Images:Pre-resize images. Reduce image resolution to the minimum required/needed for model (can be configurable). Pre-resize to pre-defined aspect ratio buckets. Save dims/shape info, image stats, meta-info.
- [P2] Videos?
Models like Qwen 2.5 VL do support video sequences. Dealing with large video data may require support/optimization at data level.
- [P1] Dataset deduplication and decontamination
- Between 2+ datasets
- Within train/test splits
- [P0] Define reusable underlying file format: Parquet, HDF5, safetensors…?
- Self-describing schema.
- Should have good/scalable support for large multimodal blobs.
- Example: https://github.com/fleonce/safetensors-dataset/ Based on safetensors. Uses nested tensors for variable-length data.
- [P1] Dataset mixtures: Pre-compute blended data.
- [P1] Pre-tokenize.
- [P2] Define Conversation format for SFT data (protobuf?)
- [P2] Pre-apply chat templates
- [P0] Support for optimized/async data loading: Mapped vs Iterable.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed