Skip to content

Conversation

@michel-aractingi
Copy link
Collaborator

@michel-aractingi michel-aractingi commented Oct 29, 2025

What this does

Fixes the logic in modify_features that was resulting in datasets with missing episodes.

Bug

When using add_features() in dataset_tools.py, only a small fraction of episodes were being copied to the new dataset.

I caught this when using add_features to add a reward field to the libero dataset (1693 episodes, 273,465 frames), the resulting dataset only contained 175 episodes and 47,369 frames.

The bug was in the _copy_data_with_feature_changes function. The buggy implementation was due to reading the metadata parquet files instead of the data files directly. It caused multiple source files to be written to the same destination file, while the metadata was directly copied from the source metadata. This resulted in data being overwritten and lost.

For example:

  • Episode metadata claimed episodes 0-14 were in file-000.parquet
  • But the actual file only contained episodes 0-2
  • This caused multiple source files to overwrite each other when written to the destination
# What metadata said:
# file-000.parquet: episodes [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
# file-001.parquet: episodes [15, 16, 17, 18, 19]

# What files actually contained:
# file-000.parquet: episodes [0, 1, 2]  (only 843 rows)
# file-001.parquet: episodes [3, 4, 5]  (only 803 rows)
# file-002.parquet: episodes [6, 7]     (only 551 rows)
# ... and so on

Fix

The fix changes the approach to:

  1. Directly iterate over parquet files in the source dataset
  2. Extract chunk and file indices from the file paths themselves
  3. Preserve the exact same file structure in the destination

This ensures a 1:1 mapping between source and destination files, preventing any data loss.

Testing

Tested with the HuggingFaceVLA/libero dataset by generating the same dataset with the next.reward feature in aractingi/libero-reward

Copilot AI review requested due to automatic review settings October 29, 2025 17:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the _copy_data_with_feature_changes function to simplify how it determines chunk and file indices when copying dataset files. Instead of loading episode metadata and mapping files to episodes, the function now directly parses the indices from the file paths.

  • Removed dependency on episode metadata loading for extracting chunk/file indices
  • Changed from episode metadata-based extraction to direct file path parsing
  • Simplified the file processing logic by eliminating the file-to-episodes mapping

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for src_path in tqdm(parquet_files, desc="Processing data files"):
df = pd.read_parquet(src_path).reset_index(drop=True)

relative_path = src_path.relative_to(dataset.root)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded array indexing without bounds checking. If the path structure differs from expected data/chunk-XXX/file-YYY.parquet, this will raise an IndexError. Consider validating that len(relative_path.parts) >= 3 before accessing indices.

Suggested change
relative_path = src_path.relative_to(dataset.root)
relative_path = src_path.relative_to(dataset.root)
if len(relative_path.parts) < 3:
raise ValueError(
f"Unexpected path structure for {src_path}: expected at least 3 parts, got {len(relative_path.parts)} ({relative_path.parts})"
)

Copilot uses AI. Check for mistakes.
Comment on lines +980 to +981
chunk_idx = int(chunk_dir.split("-")[1])
file_idx = int(file_name.split("-")[1].split(".")[0])
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String parsing without validation. If file naming convention differs from chunk-{number} or file-{number}.parquet, this will raise IndexError or ValueError. Consider adding error handling or validation to ensure the expected format before parsing, or using regex with pattern matching for more robust extraction.

Copilot uses AI. Check for mistakes.
@michel-aractingi michel-aractingi self-assigned this Oct 29, 2025
@michel-aractingi michel-aractingi added bug Something isn’t working correctly dataset Issues regarding data inputs, processing, or datasets labels Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn’t working correctly dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants