-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Ticket Type
🐛 Bug Report (Something isn't working)
Environment & System Info
- LeRobot version: 0.4.3
- Platform: macOS-15.6.1-arm64-arm-64bit
- Python version: 3.10.13
- Huggingface Hub version: 0.34.3
- Datasets version: 4.1.1
- Numpy version: 2.2.6
- FFmpeg version: 7.1.1
- PyTorch version: 2.7.1
- Is PyTorch built with CUDA support?: False
- Cuda version: N/A
- GPU model: N/A
- Using GPU in script?: NO
- lerobot scripts: ['lerobot-calibrate', 'lerobot-dataset-viz', 'lerobot-edit-dataset', 'lerobot-eval', 'lerobot-find-cameras', 'lerobot-find-joint-limits', 'lerobot-find-port', 'lerobot-imgtransform-viz', 'lerobot-info', 'lerobot-record', 'lerobot-replay', 'lerobot-setup-motors', 'lerobot-teleoperate', 'lerobot-train']Description
When aggregating image-based datasets using aggregate_datasets, the resulting parquet files lose the HuggingFace Image feature type. Image columns are written with a generic struct schema ({'bytes': Value('binary'), 'path': Value('string')}) instead of the proper Image() feature type. In turn, the aggregated dataset becomes difficult to use because the features in the parquet file are dropped and bytes are written. Visualizations on the hub are also faulty. Same applies to merge datasets.
This directly causes downstream issues when loading the aggregated dataset (from personal experience, 😅)
Context & Reproduction
Root cause
The current to_parquet_with_hf_images function creates a datasets.Dataset without specifying the features schema:
def to_parquet_with_hf_images(df: pandas.DataFrame, path: Path) -> None: datasets.Dataset.from_dict(df.to_dict(orient="list")).to_parquet(path)When features is not provided to Dataset.from_dict(), datasets infers the schema from the data (internally stored as {'bytes': ..., 'path': ...} dicts), these get typed as a generic struct rather than the Image() feature type.
Relevant logs or stack trace
Checklist
- I have searched existing tickets to ensure this isn't a duplicate.
- I am using the latest version of the
mainbranch. - I have verified this is not an environment-specific problem.
Additional Info / Workarounds
(opening a PR solving this issue right now, 😉)