[WIP]Multimodal parquet reader wds writer#1559
Draft
VibhuJawa wants to merge 52 commits intoNVIDIA-NeMo:mainfrom
Draft
[WIP]Multimodal parquet reader wds writer#1559VibhuJawa wants to merge 52 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa wants to merge 52 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
… headers Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…split_table, expand tests - Rename metadata_source -> source_ref with cleaner schema (path, member, byte_offset, byte_size) and soft migration for older content_path/content_key payloads - Fix iter_materialized_bytes index alignment bug (positional vs DataFrame index) - Fix variable shadowing in webdataset reader materialize-on-read path - Fix groupby dropping NaN rows in materialization (add dropna=False) - Optimize split_table_by_group_max_bytes: sort+slice O(n log n) instead of filter-per-group O(groups * rows) - Remove source_shard from _ReadContext, derive inline from tar_path - Embed _sample_source provenance in metadata_json instead of source_ref - Fix BaseMultimodalReader.name default from empty string - Add 16 new tests (28 total): split_table, validity mask, composite decompose, materialize, source_ref parsing with soft migration - Add AGENTS.md with environment setup and benchmark dataset paths Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…put_metrics Use get_all_file_paths_and_size_under once and extract paths from the result tuples instead of calling both functions on the same directory. Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…oqa suppressions - Add _classify_rows() to partition image rows into tar-extract, range-read, and direct-read groups based on source_ref content - Add _fill_range_read_rows() using fs.cat_ranges() for batched byte-range reads (key optimization for remote/S3 paths with byte_offset/byte_size) - Populate byte_offset/byte_size from TarInfo in WebdatasetReaderStage so downstream materialization can use range reads - Remove duplicate _load_image_bytes_from_tar from webdataset reader; use shared _extract_tar_member helper - Remove load_bytes_from_content_reference and load_bytes_from_source_ref (consolidated into the three-strategy dispatch) - Replace all bare except Exception (noqa: BLE001) with specific exception types: OSError, tarfile.TarError, ValueError, SyntaxError - Add no-noqa rule to AGENTS.md - Add 9 new tests for classify_rows, range-read, tar-extract, mixed dispatch, and error handling (37 total) Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…BatchTask - Add RESERVED_COLUMNS frozenset derived from MULTIMODAL_SCHEMA as single source of truth for pipeline-managed column names - Add REQUIRED_COLUMNS class attribute derived from non-nullable schema fields - Add module docstring with full reserved vs user column reference table - Reader _build_passthrough_row now uses RESERVED_COLUMNS instead of re-listing all 9 column names - Reorganize MultiBatchTask with section comments and method docstrings - Update README with reserved/user column distinction and passthrough example Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Writer: - Merge BaseMultimodalTabularWriter into BaseMultimodalWriter (3 layers -> 2) - materialize_on_write, _materialize_dataframe(), and write_data() template now live in the base; subclasses only implement _write_dataframe() - MultimodalParquetWriterStage inherits BaseMultimodalWriter directly - Remove BaseMultimodalTabularWriter from exports Reader: - Add _SampleContext dataclass bundling per-sample state (sample_id, sample, tar_path, json_member_name, member_names, member_info, passthrough) - _metadata_row, _text_rows, _image_rows each take (self, ctx) instead of 6 separate positional args -- cleaner for subclassing - _build_row is now a @staticmethod taking (ctx, row_fields dict) - _build_source_ref takes (self, ctx, content_key) instead of 4 params Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
- Deduplicate range reads in _fill_range_read_rows: multiple rows referencing the same byte range now share a single fs.cat_ranges() call (14x I/O reduction for MINT-1T TIFF data) - Extract individual frames from multi-frame TIFFs during materialization (frame_index in source_ref), eliminating 17x data duplication in parquet output (45GB -> 865MB for 5 shards) - Fix _resolve_image_content_key: None tokens produce empty source_ref (no materialization), non-matching strings fall back to default member with frame_index for TIFF content types - Fix _build_source_ref: content_key=None emits path=None to prevent reading entire tar file as raw bytes via direct_read path - Reconcile table schema for non-empty tables (canonical types for reserved columns, inferred types for passthrough) - Optimize collect_parquet_output_metrics: use pq.ParquetFile metadata for row counts and selective column reads instead of pd.read_parquet - Remove dead content_path/content_key soft-migration from parse_source_ref (net-new feature, no legacy data) - Clear byte_cache per sample in reader to bound memory during materialize_on_read Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
- Add frame_index parameter to build_source_ref signature, source_ref JSON example, bullet descriptions, and schema table - Document TIFF frame extraction behavior in materialization section - Fix MultimodalParquetWriter -> MultimodalParquetWriterStage in architecture diagram Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
- Add MultimodalParquetReaderStage and composite MultimodalParquetReader for reading parquet files in MULTIMODAL_SCHEMA format - Add MultimodalWebdatasetWriterStage for writing multimodal rows to WebDataset tar shards with fsspec support for remote (S3) writes - Extract shared read_parquet_files utility from text ParquetReaderStage - Add split_table_by_group_max_bytes to core utils for Arrow table splitting - Add Parquet-to-WDS benchmark script with S3 source_ref filtering - Fix test_classify_rows_range_read assertion (4-tuple not 5-tuple) - Fix lint: unused static method args and pd.isnull -> pd.isna in tests - Fix _write_tar to use fsspec.open() for remote filesystem compatibility Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Skip None placeholder entries in _text_rows and _image_rows so that parallel texts/images arrays produce non-overlapping interleaved positions instead of duplicate rows at every position. Before: 996 rows per shard (470 text + 470 image + 56 metadata) with overlapping positions between text and image modalities. After: 526 rows per shard (228 text + 242 image + 56 metadata) with correct non-overlapping interleaved positions. Also replace split_table_by_group_max_bytes with a faster Arrow-native implementation using pc.not_equal + pc.indices_nonzero for group boundary detection and average-bytes-per-row estimation. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
…ilter position recompute - Change metadata_json category from Internal to Content (docs + README) - Add count(modality=) method for filtered row counts; num_items now returns unique sample count (distinct sample_id values) - Add add_rows() and delete_rows() stubs (NotImplementedError) - Recompute content positions in BaseMultimodalFilterStage after row drops to close gaps (metadata rows keep position=-1) - Add tests for position recomputation, count(), and pandas data path Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
…mary Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
…quet_reader_wds_writer
…ation
- Add MultimodalLanceWriterStage extending BaseMultimodalWriter for Lance format
- Fix WebDataset writer to preserve interleaving via position-indexed parallel
texts/images arrays (previously compact arrays lost position information)
- Fix _sanitize_key to strip dots from WebDataset keys (dots are extension
separators per the WDS spec; sample_ids containing ".parquet" caused all
samples to merge into one)
- Use position-based image naming (e.g., "0.jpg", "3.jpg") so images[pos]
directly maps to sample[images[pos]] per WebDataset dict key convention
- Preserve all extra (non-schema) columns in WebDataset JSON via _row_extra
structured by modality: {text: [...], image: [...], metadata: {...}} with
1:1 alignment to texts/images arrays for zero data loss
- Fix BaseMultimodalWriter.write_data to not leak storage_options into
format-specific _write_dataframe calls
- Add test_merged_parquet_3format_write.py benchmark script for Parquet,
WebDataset, and Lance materialized writes from merged domain-bucket data
- Add explore_outputs.ipynb verification notebook with webdataset/lance/PIL
compliance checks and filtering UX demos
- Add unit tests for Lance writer and WDS extra column preservation
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
The benchmark script and notebook are local test files that should not be tracked. They remain on disk for local use. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Usage
# Add snippet demonstrating usageChecklist