[Rust] Data preparation

Data preparation and data storage/format is an important and fairly isolated piece of training functionality, which may be a good match for an efficient Rust implementation. This is particularly important for multimodal data. Below are some features that would be useful to support:

* [P1] Shard data per-node, per-GPU (pre-process offline, or online/iterable).
* [P0] Group examples by length, by number of images, etc - for more efficient GPU utilization and/or for tensor shape compatibility within a mini-batch.
* [P1] Multimodal: 
    * [P0] Images:Pre-resize images. Reduce image resolution to the minimum required/needed for model (can be configurable). Pre-resize to pre-defined aspect ratio buckets. Save dims/shape info, image stats, meta-info.
    * [P2] Videos?
Models like Qwen 2.5 VL do support video sequences. Dealing with large video data may require support/optimization at data level.
* [P1] Dataset deduplication and decontamination
    * Between 2+ datasets
    * Within train/test splits
* [P0] Define reusable underlying file format: Parquet, HDF5, safetensors…?
    * Self-describing schema.
    * Should have good/scalable support for large multimodal blobs.
    * Example: [https://github.com/fleonce/safetensors-dataset/](https://github.com/fleonce/safetensors-dataset/tree/main) Based on safetensors. Uses [nested tensors](https://pytorch.org/docs/stable/nested.html) for variable-length data.
* [P1] Dataset mixtures: Pre-compute blended data.
* [P1] Pre-tokenize.
* [P2] Define Conversation format for SFT data (protobuf?)
* [P2] Pre-apply chat templates
* [P0] Support for optimized/async data loading: Mapped vs Iterable.

[Doc link](https://docs.google.com/document/d/1FcZYLK4ylSAogvnZNc0PMX_2PS-eyf4IP3qCjGuGBxI/edit?pli=1&tab=t.0#bookmark=id.fxhdy8ttyfo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rust] Data preparation #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Rust] Data preparation #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions