[WIP]Multimodal parquet reader wds writer by VibhuJawa · Pull Request #1559 · NVIDIA-NeMo/Curator

VibhuJawa · 2026-02-27T01:31:44Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

… headers Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…split_table, expand tests - Rename metadata_source -> source_ref with cleaner schema (path, member, byte_offset, byte_size) and soft migration for older content_path/content_key payloads - Fix iter_materialized_bytes index alignment bug (positional vs DataFrame index) - Fix variable shadowing in webdataset reader materialize-on-read path - Fix groupby dropping NaN rows in materialization (add dropna=False) - Optimize split_table_by_group_max_bytes: sort+slice O(n log n) instead of filter-per-group O(groups * rows) - Remove source_shard from _ReadContext, derive inline from tar_path - Embed _sample_source provenance in metadata_json instead of source_ref - Fix BaseMultimodalReader.name default from empty string - Add 16 new tests (28 total): split_table, validity mask, composite decompose, materialize, source_ref parsing with soft migration - Add AGENTS.md with environment setup and benchmark dataset paths Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…put_metrics Use get_all_file_paths_and_size_under once and extract paths from the result tuples instead of calling both functions on the same directory. Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…oqa suppressions - Add _classify_rows() to partition image rows into tar-extract, range-read, and direct-read groups based on source_ref content - Add _fill_range_read_rows() using fs.cat_ranges() for batched byte-range reads (key optimization for remote/S3 paths with byte_offset/byte_size) - Populate byte_offset/byte_size from TarInfo in WebdatasetReaderStage so downstream materialization can use range reads - Remove duplicate _load_image_bytes_from_tar from webdataset reader; use shared _extract_tar_member helper - Remove load_bytes_from_content_reference and load_bytes_from_source_ref (consolidated into the three-strategy dispatch) - Replace all bare except Exception (noqa: BLE001) with specific exception types: OSError, tarfile.TarError, ValueError, SyntaxError - Add no-noqa rule to AGENTS.md - Add 9 new tests for classify_rows, range-read, tar-extract, mixed dispatch, and error handling (37 total) Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…BatchTask - Add RESERVED_COLUMNS frozenset derived from MULTIMODAL_SCHEMA as single source of truth for pipeline-managed column names - Add REQUIRED_COLUMNS class attribute derived from non-nullable schema fields - Add module docstring with full reserved vs user column reference table - Reader _build_passthrough_row now uses RESERVED_COLUMNS instead of re-listing all 9 column names - Reorganize MultiBatchTask with section comments and method docstrings - Update README with reserved/user column distinction and passthrough example Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

@staticmethod

Writer: - Merge BaseMultimodalTabularWriter into BaseMultimodalWriter (3 layers -> 2) - materialize_on_write, _materialize_dataframe(), and write_data() template now live in the base; subclasses only implement _write_dataframe() - MultimodalParquetWriterStage inherits BaseMultimodalWriter directly - Remove BaseMultimodalTabularWriter from exports Reader: - Add _SampleContext dataclass bundling per-sample state (sample_id, sample, tar_path, json_member_name, member_names, member_info, passthrough) - _metadata_row, _text_rows, _image_rows each take (self, ctx) instead of 6 separate positional args -- cleaner for subclassing - _build_row is now a @staticmethod taking (ctx, row_fields dict) - _build_source_ref takes (self, ctx, content_key) instead of 4 params Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

- Deduplicate range reads in _fill_range_read_rows: multiple rows referencing the same byte range now share a single fs.cat_ranges() call (14x I/O reduction for MINT-1T TIFF data) - Extract individual frames from multi-frame TIFFs during materialization (frame_index in source_ref), eliminating 17x data duplication in parquet output (45GB -> 865MB for 5 shards) - Fix _resolve_image_content_key: None tokens produce empty source_ref (no materialization), non-matching strings fall back to default member with frame_index for TIFF content types - Fix _build_source_ref: content_key=None emits path=None to prevent reading entire tar file as raw bytes via direct_read path - Reconcile table schema for non-empty tables (canonical types for reserved columns, inferred types for passthrough) - Optimize collect_parquet_output_metrics: use pq.ParquetFile metadata for row counts and selective column reads instead of pd.read_parquet - Remove dead content_path/content_key soft-migration from parse_source_ref (net-new feature, no legacy data) - Clear byte_cache per sample in reader to bound memory during materialize_on_read Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

- Add frame_index parameter to build_source_ref signature, source_ref JSON example, bullet descriptions, and schema table - Document TIFF frame extraction behavior in materialization section - Fix MultimodalParquetWriter -> MultimodalParquetWriterStage in architecture diagram Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

- Add MultimodalParquetReaderStage and composite MultimodalParquetReader for reading parquet files in MULTIMODAL_SCHEMA format - Add MultimodalWebdatasetWriterStage for writing multimodal rows to WebDataset tar shards with fsspec support for remote (S3) writes - Extract shared read_parquet_files utility from text ParquetReaderStage - Add split_table_by_group_max_bytes to core utils for Arrow table splitting - Add Parquet-to-WDS benchmark script with S3 source_ref filtering - Fix test_classify_rows_range_read assertion (4-tuple not 5-tuple) - Fix lint: unused static method args and pd.isnull -> pd.isna in tests - Fix _write_tar to use fsspec.open() for remote filesystem compatibility Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

copy-pr-bot · 2026-02-27T01:31:48Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Skip None placeholder entries in _text_rows and _image_rows so that parallel texts/images arrays produce non-overlapping interleaved positions instead of duplicate rows at every position. Before: 996 rows per shard (470 text + 470 image + 56 metadata) with overlapping positions between text and image modalities. After: 526 rows per shard (228 text + 242 image + 56 metadata) with correct non-overlapping interleaved positions. Also replace split_table_by_group_max_bytes with a faster Arrow-native implementation using pc.not_equal + pc.indices_nonzero for group boundary detection and average-bytes-per-row estimation. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

…ilter position recompute - Change metadata_json category from Internal to Content (docs + README) - Add count(modality=) method for filtered row counts; num_items now returns unique sample count (distinct sample_id values) - Add add_rows() and delete_rows() stubs (NotImplementedError) - Recompute content positions in BaseMultimodalFilterStage after row drops to close gaps (metadata rows keep position=-1) - Add tests for position recomputation, count(), and pandas data path Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

…mary Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

…quet_reader_wds_writer

…ation - Add MultimodalLanceWriterStage extending BaseMultimodalWriter for Lance format - Fix WebDataset writer to preserve interleaving via position-indexed parallel texts/images arrays (previously compact arrays lost position information) - Fix _sanitize_key to strip dots from WebDataset keys (dots are extension separators per the WDS spec; sample_ids containing ".parquet" caused all samples to merge into one) - Use position-based image naming (e.g., "0.jpg", "3.jpg") so images[pos] directly maps to sample[images[pos]] per WebDataset dict key convention - Preserve all extra (non-schema) columns in WebDataset JSON via _row_extra structured by modality: {text: [...], image: [...], metadata: {...}} with 1:1 alignment to texts/images arrays for zero data loss - Fix BaseMultimodalWriter.write_data to not leak storage_options into format-specific _write_dataframe calls - Add test_merged_parquet_3format_write.py benchmark script for Parquet, WebDataset, and Lance materialized writes from merged domain-bucket data - Add explore_outputs.ipynb verification notebook with webdataset/lance/PIL compliance checks and filtering UX demos - Add unit tests for Lance writer and WDS extra column preservation Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

The benchmark script and notebook are local test files that should not be tracked. They remain on disk for local use. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

VibhuJawa and others added 30 commits February 24, 2026 01:46

Add multimodal MINT1T reader/writer pipeline and scoped benchmarks

fcb7de3

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Move parquet output metrics helper to benchmark utils and update 2026…

9dbe24a

… headers Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Clean up multimodal reader/writer config and cloud materialization path

4b6f3e0

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Clean up multimodal stage readability and consolidate test fixtures

c42bacf

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Refine multimodal JPEG filter abstractions and naming

73a66d3

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Require explicit source_id_field in multimodal webdataset reader

5d3547f

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Optimize parsed source column extraction for multimodal task

21908bf

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove redundant direct materialization branch and add regression test

ed901be

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Refactor multimodal materialization and shared IO defaults

4e90268

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Unify multimodal reader field projection and tighten validation

47a147b

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Extract source field validation util and reduce multimodal LOC

fb29991

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Require multimodal writer path and add Pillow image dependency

ba3ccdc

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Move multimodal validation helpers and harden metadata source parsing

545f224

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Avoid tar reopen in multimodal materialize-on-read path

25e3bd8

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Make PIL import lazy in multimodal aspect ratio filter

279858c

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Fix multimodal defaults for tar formats and empty task table

0a99dce

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Trim multimodal reader/task duplication and simplify parsing

c5e39bc

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove redundant multimodal source-id validation unit test

c2a1cd3

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Restore explicit multimodal parsing and reader field exclusions

9d37539

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Consolidate multimodal metadata source missing-value handling

2d62af2

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Update nemo_curator/stages/multimodal/io/writers/multimodal.py

89e7df2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove multimodal pyarrow writer backend toggle

cc6e4b2

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Update uv.lock for Pillow image extras

7faee07

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Simplify multimodal materialization and remove writer aliases

7d2093c

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove redundant multimodal JPEG column guard

3c42153

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Move multimodal row-validity filtering into base filter

2a0e0f8

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Fix ruff unused-noqa and local PIL import in multimodal stage

0e6ccaa

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Align multimodal PIL import with text-stage local import style

fa50bea

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Apply suggestions from code review

fdd3e22

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Fix multimodal index alignment, content_type, schema, and parquet index

5bf4e76

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa added 14 commits February 24, 2026 01:46

Fix multimodal alignment and materialization behavior

fe57ab6

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove stale BLE001 noqa directives in MINT1T benchmark

85b9cf3

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Merge branch 'main' into minimal_multi_modal_implimentation

e958250

Remove AGENTS.md from tracking and add to .gitignore

16e4f9d

Made-with: Cursor Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Merge branch 'main' into minimal_multi_modal_implimentation

ed3e14b

Merge branch 'main' into minimal_multi_modal_implimentation

a45b423

VibhuJawa added 8 commits February 27, 2026 02:19

Merge branch 'main' into minimal_multi_modal_implimentation

802fa73

Improve source_ref docstring with inline materialization strategy sum…

f834c6e

…mary Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

Clarify modality field is extensible beyond built-in values

45f12dc

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

Merge branch 'minimal_multi_modal_implimentation' into multimodal_par…

c6bdb20

…quet_reader_wds_writer

Remove local test artifacts from tracking, add to .gitignore

459221f

The benchmark script and notebook are local test files that should not be tracked. They remain on disk for local use. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Multimodal parquet reader wds writer#1559

[WIP]Multimodal parquet reader wds writer#1559
VibhuJawa wants to merge 52 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:multimodal_parquet_reader_wds_writer

VibhuJawa commented Feb 27, 2026

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VibhuJawa commented Feb 27, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant