Skip to content

[Rust] Data preparation #11

@nikg7

Description

@nikg7

Data preparation and data storage/format is an important and fairly isolated piece of training functionality, which may be a good match for an efficient Rust implementation. This is particularly important for multimodal data. Below are some features that would be useful to support:

  • [P1] Shard data per-node, per-GPU (pre-process offline, or online/iterable).
  • [P0] Group examples by length, by number of images, etc - for more efficient GPU utilization and/or for tensor shape compatibility within a mini-batch.
  • [P1] Multimodal:
    • [P0] Images:Pre-resize images. Reduce image resolution to the minimum required/needed for model (can be configurable). Pre-resize to pre-defined aspect ratio buckets. Save dims/shape info, image stats, meta-info.
    • [P2] Videos?
      Models like Qwen 2.5 VL do support video sequences. Dealing with large video data may require support/optimization at data level.
  • [P1] Dataset deduplication and decontamination
    • Between 2+ datasets
    • Within train/test splits
  • [P0] Define reusable underlying file format: Parquet, HDF5, safetensors…?
  • [P1] Dataset mixtures: Pre-compute blended data.
  • [P1] Pre-tokenize.
  • [P2] Define Conversation format for SFT data (protobuf?)
  • [P2] Pre-apply chat templates
  • [P0] Support for optimized/async data loading: Mapped vs Iterable.

Doc link

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions