Skip to content

aggregate_datasets loses Image feature schema for image datasets #2715

@fracapuano

Description

@fracapuano

Ticket Type

🐛 Bug Report (Something isn't working)

Environment & System Info

- LeRobot version: 0.4.3
- Platform: macOS-15.6.1-arm64-arm-64bit
- Python version: 3.10.13
- Huggingface Hub version: 0.34.3
- Datasets version: 4.1.1
- Numpy version: 2.2.6
- FFmpeg version: 7.1.1
- PyTorch version: 2.7.1
- Is PyTorch built with CUDA support?: False
- Cuda version: N/A
- GPU model: N/A
- Using GPU in script?: NO
- lerobot scripts: ['lerobot-calibrate', 'lerobot-dataset-viz', 'lerobot-edit-dataset', 'lerobot-eval', 'lerobot-find-cameras', 'lerobot-find-joint-limits', 'lerobot-find-port', 'lerobot-imgtransform-viz', 'lerobot-info', 'lerobot-record', 'lerobot-replay', 'lerobot-setup-motors', 'lerobot-teleoperate', 'lerobot-train']

Description

When aggregating image-based datasets using aggregate_datasets, the resulting parquet files lose the HuggingFace Image feature type. Image columns are written with a generic struct schema ({'bytes': Value('binary'), 'path': Value('string')}) instead of the proper Image() feature type. In turn, the aggregated dataset becomes difficult to use because the features in the parquet file are dropped and bytes are written. Visualizations on the hub are also faulty. Same applies to merge datasets.

This directly causes downstream issues when loading the aggregated dataset (from personal experience, 😅)

Context & Reproduction

Root cause

The current to_parquet_with_hf_images function creates a datasets.Dataset without specifying the features schema:

def to_parquet_with_hf_images(df: pandas.DataFrame, path: Path) -> None:    datasets.Dataset.from_dict(df.to_dict(orient="list")).to_parquet(path)

When features is not provided to Dataset.from_dict(), datasets infers the schema from the data (internally stored as {'bytes': ..., 'path': ...} dicts), these get typed as a generic struct rather than the Image() feature type.

Relevant logs or stack trace

Checklist

  • I have searched existing tickets to ensure this isn't a duplicate.
  • I am using the latest version of the main branch.
  • I have verified this is not an environment-specific problem.

Additional Info / Workarounds

(opening a PR solving this issue right now, 😉)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn’t working correctlydatasetIssues regarding data inputs, processing, or datasets

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions