Skip to content

[WIP]Multimodal parquet reader wds writer#1559

Draft
VibhuJawa wants to merge 52 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:multimodal_parquet_reader_wds_writer
Draft

[WIP]Multimodal parquet reader wds writer#1559
VibhuJawa wants to merge 52 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:multimodal_parquet_reader_wds_writer

Conversation

@VibhuJawa
Copy link
Contributor

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

VibhuJawa and others added 30 commits February 24, 2026 01:46
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
… headers

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…split_table, expand tests

- Rename metadata_source -> source_ref with cleaner schema (path, member, byte_offset, byte_size)
  and soft migration for older content_path/content_key payloads
- Fix iter_materialized_bytes index alignment bug (positional vs DataFrame index)
- Fix variable shadowing in webdataset reader materialize-on-read path
- Fix groupby dropping NaN rows in materialization (add dropna=False)
- Optimize split_table_by_group_max_bytes: sort+slice O(n log n) instead of
  filter-per-group O(groups * rows)
- Remove source_shard from _ReadContext, derive inline from tar_path
- Embed _sample_source provenance in metadata_json instead of source_ref
- Fix BaseMultimodalReader.name default from empty string
- Add 16 new tests (28 total): split_table, validity mask, composite decompose,
  materialize, source_ref parsing with soft migration
- Add AGENTS.md with environment setup and benchmark dataset paths

Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…put_metrics

Use get_all_file_paths_and_size_under once and extract paths from the
result tuples instead of calling both functions on the same directory.

Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…oqa suppressions

- Add _classify_rows() to partition image rows into tar-extract, range-read,
  and direct-read groups based on source_ref content
- Add _fill_range_read_rows() using fs.cat_ranges() for batched byte-range
  reads (key optimization for remote/S3 paths with byte_offset/byte_size)
- Populate byte_offset/byte_size from TarInfo in WebdatasetReaderStage so
  downstream materialization can use range reads
- Remove duplicate _load_image_bytes_from_tar from webdataset reader; use
  shared _extract_tar_member helper
- Remove load_bytes_from_content_reference and load_bytes_from_source_ref
  (consolidated into the three-strategy dispatch)
- Replace all bare except Exception (noqa: BLE001) with specific exception
  types: OSError, tarfile.TarError, ValueError, SyntaxError
- Add no-noqa rule to AGENTS.md
- Add 9 new tests for classify_rows, range-read, tar-extract, mixed dispatch,
  and error handling (37 total)

Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…BatchTask

- Add RESERVED_COLUMNS frozenset derived from MULTIMODAL_SCHEMA as single
  source of truth for pipeline-managed column names
- Add REQUIRED_COLUMNS class attribute derived from non-nullable schema fields
- Add module docstring with full reserved vs user column reference table
- Reader _build_passthrough_row now uses RESERVED_COLUMNS instead of
  re-listing all 9 column names
- Reorganize MultiBatchTask with section comments and method docstrings
- Update README with reserved/user column distinction and passthrough example

Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Writer:
- Merge BaseMultimodalTabularWriter into BaseMultimodalWriter (3 layers -> 2)
- materialize_on_write, _materialize_dataframe(), and write_data() template
  now live in the base; subclasses only implement _write_dataframe()
- MultimodalParquetWriterStage inherits BaseMultimodalWriter directly
- Remove BaseMultimodalTabularWriter from exports

Reader:
- Add _SampleContext dataclass bundling per-sample state (sample_id, sample,
  tar_path, json_member_name, member_names, member_info, passthrough)
- _metadata_row, _text_rows, _image_rows each take (self, ctx) instead of
  6 separate positional args -- cleaner for subclassing
- _build_row is now a @staticmethod taking (ctx, row_fields dict)
- _build_source_ref takes (self, ctx, content_key) instead of 4 params

Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
- Deduplicate range reads in _fill_range_read_rows: multiple rows
  referencing the same byte range now share a single fs.cat_ranges()
  call (14x I/O reduction for MINT-1T TIFF data)
- Extract individual frames from multi-frame TIFFs during
  materialization (frame_index in source_ref), eliminating 17x data
  duplication in parquet output (45GB -> 865MB for 5 shards)
- Fix _resolve_image_content_key: None tokens produce empty source_ref
  (no materialization), non-matching strings fall back to default member
  with frame_index for TIFF content types
- Fix _build_source_ref: content_key=None emits path=None to prevent
  reading entire tar file as raw bytes via direct_read path
- Reconcile table schema for non-empty tables (canonical types for
  reserved columns, inferred types for passthrough)
- Optimize collect_parquet_output_metrics: use pq.ParquetFile metadata
  for row counts and selective column reads instead of pd.read_parquet
- Remove dead content_path/content_key soft-migration from
  parse_source_ref (net-new feature, no legacy data)
- Clear byte_cache per sample in reader to bound memory during
  materialize_on_read

Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
- Add frame_index parameter to build_source_ref signature, source_ref
  JSON example, bullet descriptions, and schema table
- Document TIFF frame extraction behavior in materialization section
- Fix MultimodalParquetWriter -> MultimodalParquetWriterStage in
  architecture diagram

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
- Add MultimodalParquetReaderStage and composite MultimodalParquetReader
  for reading parquet files in MULTIMODAL_SCHEMA format
- Add MultimodalWebdatasetWriterStage for writing multimodal rows to
  WebDataset tar shards with fsspec support for remote (S3) writes
- Extract shared read_parquet_files utility from text ParquetReaderStage
- Add split_table_by_group_max_bytes to core utils for Arrow table splitting
- Add Parquet-to-WDS benchmark script with S3 source_ref filtering
- Fix test_classify_rows_range_read assertion (4-tuple not 5-tuple)
- Fix lint: unused static method args and pd.isnull -> pd.isna in tests
- Fix _write_tar to use fsspec.open() for remote filesystem compatibility

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Skip None placeholder entries in _text_rows and _image_rows so that
parallel texts/images arrays produce non-overlapping interleaved
positions instead of duplicate rows at every position.

Before: 996 rows per shard (470 text + 470 image + 56 metadata) with
overlapping positions between text and image modalities.
After: 526 rows per shard (228 text + 242 image + 56 metadata) with
correct non-overlapping interleaved positions.

Also replace split_table_by_group_max_bytes with a faster Arrow-native
implementation using pc.not_equal + pc.indices_nonzero for group
boundary detection and average-bytes-per-row estimation.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
…ilter position recompute

- Change metadata_json category from Internal to Content (docs + README)
- Add count(modality=) method for filtered row counts; num_items now
  returns unique sample count (distinct sample_id values)
- Add add_rows() and delete_rows() stubs (NotImplementedError)
- Recompute content positions in BaseMultimodalFilterStage after row
  drops to close gaps (metadata rows keep position=-1)
- Add tests for position recomputation, count(), and pandas data path

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
…mary

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
…ation

- Add MultimodalLanceWriterStage extending BaseMultimodalWriter for Lance format
- Fix WebDataset writer to preserve interleaving via position-indexed parallel
  texts/images arrays (previously compact arrays lost position information)
- Fix _sanitize_key to strip dots from WebDataset keys (dots are extension
  separators per the WDS spec; sample_ids containing ".parquet" caused all
  samples to merge into one)
- Use position-based image naming (e.g., "0.jpg", "3.jpg") so images[pos]
  directly maps to sample[images[pos]] per WebDataset dict key convention
- Preserve all extra (non-schema) columns in WebDataset JSON via _row_extra
  structured by modality: {text: [...], image: [...], metadata: {...}} with
  1:1 alignment to texts/images arrays for zero data loss
- Fix BaseMultimodalWriter.write_data to not leak storage_options into
  format-specific _write_dataframe calls
- Add test_merged_parquet_3format_write.py benchmark script for Parquet,
  WebDataset, and Lance materialized writes from merged domain-bucket data
- Add explore_outputs.ipynb verification notebook with webdataset/lance/PIL
  compliance checks and filtering UX demos
- Add unit tests for Lance writer and WDS extra column preservation

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
The benchmark script and notebook are local test files that should not
be tracked. They remain on disk for local use.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant