-
Notifications
You must be signed in to change notification settings - Fork 2.9k
fix(dataset_tools) Critical bug in modify features #2342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the _copy_data_with_feature_changes function to simplify how it determines chunk and file indices when copying dataset files. Instead of loading episode metadata and mapping files to episodes, the function now directly parses the indices from the file paths.
- Removed dependency on episode metadata loading for extracting chunk/file indices
- Changed from episode metadata-based extraction to direct file path parsing
- Simplified the file processing logic by eliminating the file-to-episodes mapping
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for src_path in tqdm(parquet_files, desc="Processing data files"): | ||
| df = pd.read_parquet(src_path).reset_index(drop=True) | ||
|
|
||
| relative_path = src_path.relative_to(dataset.root) |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded array indexing without bounds checking. If the path structure differs from expected data/chunk-XXX/file-YYY.parquet, this will raise an IndexError. Consider validating that len(relative_path.parts) >= 3 before accessing indices.
| relative_path = src_path.relative_to(dataset.root) | |
| relative_path = src_path.relative_to(dataset.root) | |
| if len(relative_path.parts) < 3: | |
| raise ValueError( | |
| f"Unexpected path structure for {src_path}: expected at least 3 parts, got {len(relative_path.parts)} ({relative_path.parts})" | |
| ) |
| chunk_idx = int(chunk_dir.split("-")[1]) | ||
| file_idx = int(file_name.split("-")[1].split(".")[0]) |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String parsing without validation. If file naming convention differs from chunk-{number} or file-{number}.parquet, this will raise IndexError or ValueError. Consider adding error handling or validation to ensure the expected format before parsing, or using regex with pattern matching for more robust extraction.
What this does
Fixes the logic in
modify_featuresthat was resulting in datasets with missing episodes.Bug
When using
add_features()indataset_tools.py, only a small fraction of episodes were being copied to the new dataset.I caught this when using
add_featuresto add a reward field to the libero dataset (1693 episodes, 273,465 frames), the resulting dataset only contained 175 episodes and 47,369 frames.The bug was in the
_copy_data_with_feature_changesfunction. The buggy implementation was due to reading the metadata parquet files instead of the data files directly. It caused multiple source files to be written to the same destination file, while the metadata was directly copied from the source metadata. This resulted in data being overwritten and lost.For example:
file-000.parquetFix
The fix changes the approach to:
This ensures a 1:1 mapping between source and destination files, preventing any data loss.
Testing
Tested with the
HuggingFaceVLA/liberodataset by generating the same dataset with thenext.rewardfeature inaractingi/libero-reward