fix(dataset_tools) Critical bug in modify features #2342

michel-aractingi · 2025-10-29T17:22:53Z

What this does

Fixes the logic in modify_features that was resulting in datasets with missing episodes.

Bug

When using add_features() in dataset_tools.py, only a small fraction of episodes were being copied to the new dataset.

I caught this when using add_features to add a reward field to the libero dataset (1693 episodes, 273,465 frames), the resulting dataset only contained 175 episodes and 47,369 frames.

The bug was in the _copy_data_with_feature_changes function. The buggy implementation was due to reading the metadata parquet files instead of the data files directly. It caused multiple source files to be written to the same destination file, while the metadata was directly copied from the source metadata. This resulted in data being overwritten and lost.

For example:

Episode metadata claimed episodes 0-14 were in file-000.parquet
But the actual file only contained episodes 0-2
This caused multiple source files to overwrite each other when written to the destination

# What metadata said:
# file-000.parquet: episodes [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
# file-001.parquet: episodes [15, 16, 17, 18, 19]

# What files actually contained:
# file-000.parquet: episodes [0, 1, 2]  (only 843 rows)
# file-001.parquet: episodes [3, 4, 5]  (only 803 rows)
# file-002.parquet: episodes [6, 7]     (only 551 rows)
# ... and so on

Fix

The fix changes the approach to:

Directly iterate over parquet files in the source dataset
Extract chunk and file indices from the file paths themselves
Preserve the exact same file structure in the destination

This ensures a 1:1 mapping between source and destination files, preventing any data loss.

Testing

Tested with the HuggingFaceVLA/libero dataset by generating the same dataset with the next.reward feature in aractingi/libero-reward

Copilot

Pull Request Overview

This PR refactors the _copy_data_with_feature_changes function to simplify how it determines chunk and file indices when copying dataset files. Instead of loading episode metadata and mapping files to episodes, the function now directly parses the indices from the file paths.

Removed dependency on episode metadata loading for extracting chunk/file indices
Changed from episode metadata-based extraction to direct file path parsing
Simplified the file processing logic by eliminating the file-to-episodes mapping

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-29T17:25:27Z

src/lerobot/datasets/dataset_tools.py

+    for src_path in tqdm(parquet_files, desc="Processing data files"):
+        df = pd.read_parquet(src_path).reset_index(drop=True)
+
+        relative_path = src_path.relative_to(dataset.root)


Hardcoded array indexing without bounds checking. If the path structure differs from expected data/chunk-XXX/file-YYY.parquet, this will raise an IndexError. Consider validating that len(relative_path.parts) >= 3 before accessing indices.

Suggested change

relative_path = src_path.relative_to(dataset.root)

relative_path = src_path.relative_to(dataset.root)

if len(relative_path.parts) < 3:

raise ValueError(

f"Unexpected path structure for {src_path}: expected at least 3 parts, got {len(relative_path.parts)} ({relative_path.parts})"

)

Copilot · 2025-10-29T17:25:27Z

src/lerobot/datasets/dataset_tools.py

+        chunk_idx = int(chunk_dir.split("-")[1])
+        file_idx = int(file_name.split("-")[1].split(".")[0])


String parsing without validation. If file naming convention differs from chunk-{number} or file-{number}.parquet, this will raise IndexError or ValueError. Consider adding error handling or validation to ensure the expected format before parsing, or using regex with pattern matching for more robust extraction.

fix bug in _copy_data_with_feature_changes

03e0a62

Copilot AI review requested due to automatic review settings October 29, 2025 17:22

Copilot AI reviewed Oct 29, 2025

View reviewed changes

michel-aractingi self-assigned this Oct 29, 2025

michel-aractingi added bug Something isn’t working correctly dataset Issues regarding data inputs, processing, or datasets labels Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dataset_tools) Critical bug in modify features #2342

fix(dataset_tools) Critical bug in modify features #2342

michel-aractingi commented Oct 29, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 29, 2025

Uh oh!

Copilot AI Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        relative_path = src_path.relative_to(dataset.root)
+        relative_path = src_path.relative_to(dataset.root)
+        if len(relative_path.parts) < 3:
+            raise ValueError(
+                f"Unexpected path structure for {src_path}: expected at least 3 parts, got {len(relative_path.parts)} ({relative_path.parts})"
+            )

		chunk_idx = int(chunk_dir.split("-")[1])
		file_idx = int(file_name.split("-")[1].split(".")[0])

fix(dataset_tools) Critical bug in modify features #2342

Are you sure you want to change the base?

fix(dataset_tools) Critical bug in modify features #2342

Conversation

michel-aractingi commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Bug

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michel-aractingi commented Oct 29, 2025 •

edited

Loading