Proposal for more scalable seq-pack data format by marcromeyn · Pull Request #2019 · NVIDIA-NeMo/Megatron-Bridge

marcromeyn · 2026-01-21T15:52:47Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added support for Parquet-based packed datasets for improved performance and compatibility.
- Packed dataset output now defaults to Parquet format instead of legacy binary format.
Deprecations
- Legacy binary packed sequence format is deprecated; users are encouraged to migrate to Parquet format.
Dependencies
- Added optional pyarrow>=14.0.0 dependency for Parquet support.
Tests
- Added comprehensive test coverage for Parquet-based packed datasets and validation logic.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

copy-pr-bot · 2026-01-21T15:52:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

src/megatron/bridge/data/datasets/packed_sequence.py

src/megatron/bridge/data/builders/finetuning_dataset.py

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

yaoyu-33 · 2026-02-18T16:39:08Z

src/megatron/bridge/data/datasets/sft.py


-    if path.suffix == ".npy":
+    # Check for .npy packed dataset (legacy format)
+    if path_str.lower().endswith(".npy"):


Let's add a deprecation msg here for npy. instruct people to re-run data generation

marcromeyn · 2026-02-18T16:44:43Z

/ok to test

copy-pr-bot · 2026-02-18T16:44:47Z

/ok to test

@marcromeyn, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

marcromeyn · 2026-02-18T16:45:56Z

/ok to test e6e40ec

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

…/Megatron-Bridge into romeyn/parquet-sequence-pack

marcromeyn · 2026-02-25T07:20:33Z

/ok to test 3a8d112

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

coderabbitai · 2026-03-06T17:24:16Z

📝 Walkthrough

Walkthrough

The changes add comprehensive support for packed Parquet datasets in the supervised fine-tuning module. This includes introducing pyarrow as an optional dependency, implementing a new GPTSFTPackedParquetDataset class with path resolution and validation utilities, updating the dataset builder to handle parquet formats alongside legacy .npy paths, and deprecating .npy packed sequence handling in favor of parquet format.

Changes

Cohort / File(s)	Summary
Configuration `pyproject.toml`	Added optional dependency group for `parquet` with `pyarrow>=14.0.0`.
Packed Parquet Core Support `src/megatron/bridge/data/datasets/packed_parquet.py`	New file implementing full packed Parquet dataset support: `is_packed_parquet_file/spec` detection, `_resolve_parquet_paths` glob/directory expansion, `write_packed_parquet` output writer, and `GPTSFTPackedParquetDataset` class with lazy pyarrow reader lifecycle, row-group caching, multi-file support, and per-sample retrieval.
Dataset Builder Updates `src/megatron/bridge/data/builders/finetuning_dataset.py`	Refactored packed data handling to support parquet specs via new `_prepare_packed_split` helper and `_packed_path_exists` validation. Updated default output paths from `.npy` to `.idx.parquet`. Added deprecation warning for automatic `.npy` packed sequence preparation.
Dataset Factory & Routing `src/megatron/bridge/data/datasets/sft.py`	Enhanced `create_sft_dataset` to route packed parquet specs (detected via `is_packed_parquet_spec`) to `GPTSFTPackedParquetDataset`. Updated parameter types to accept `str \| Path` for flexibility. Maintains legacy `.npy` routing and chat template precedence.
Packed Sequence Path Validation `src/megatron/bridge/data/datasets/packed_sequence.py`	Added `_validate_packed_path` helper to validate `.npy` paths with deprecation warning and validate parquet specs via `is_packed_parquet_spec/resolve_packed_parquet_paths`. Updated `__post_init__` to use parquet writer for `.parquet`/`.pq` outputs.
Test Coverage `tests/unit_tests/data/datasets/test_chat_template.py`, `tests/unit_tests/data/datasets/test_packed_parquet.py`	Added 132 lines of tests validating packed parquet routing precedence, glob patterns, and `.idx.parquet` file detection. Added 383 lines of comprehensive integration tests for `GPTSFTPackedParquetDataset` covering initialization, multi-file handling, row-group caching, schema validation, and resource cleanup.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Factory as SFT Factory<br/>(create_sft_dataset)
    participant PathDetector as Path Detection<br/>(is_packed_parquet_spec)
    participant Dataset as GPTSFTPackedParquetDataset
    participant Resolver as Path Resolver<br/>(_resolve_parquet_paths)
    participant FileSystem as File System
    participant PyArrow as PyArrow Reader

    Client->>Factory: create_sft_dataset(path, ...)
    Factory->>PathDetector: is_packed_parquet_spec(path)?
    PathDetector->>FileSystem: glob/stat path
    FileSystem-->>PathDetector: matches .parquet/.pq
    PathDetector-->>Factory: true
    
    Factory->>Dataset: __init__(file_path, tokenizer, ...)
    Dataset->>Resolver: resolve_packed_parquet_paths(spec)
    Resolver->>FileSystem: glob/list files
    FileSystem-->>Resolver: sorted file list
    Resolver-->>Dataset: [file1.parquet, file2.parquet, ...]
    
    Dataset->>Dataset: _load_dataset()
    Dataset->>PyArrow: lazy initialize readers
    Dataset->>Dataset: build cumulative row/rowgroup offsets
    Dataset-->>Factory: initialized dataset
    
    Client->>Dataset: __getitem__(idx)
    Dataset->>Dataset: _locate_row(idx) via offsets
    Dataset->>PyArrow: _ensure_reader(file_idx)
    PyArrow-->>Dataset: reader handle
    Dataset->>PyArrow: read row_group
    PyArrow-->>Dataset: {input_ids, loss_mask, seq_start_id}
    Dataset->>Dataset: compute seq_boundaries
    Dataset-->>Client: {input_ids, seq_boundaries, loss_mask}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Run CICD

Suggested reviewers

cuichenx
yaoyu-33

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	❓ Inconclusive	PR introduces major changes (648-line Parquet module, core logic modifications, pyarrow dependency) with comprehensive tests, but lacks documented test results, performance metrics, convergence validation, and has placeholder PR description text with unresolved review comments.	PR author should document test pass confirmation with coverage summary, performance benchmarks comparing Parquet vs .npy formats, convergence validation, and address all review comments.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Proposal for more scalable seq-pack data format' accurately reflects the main objective of the changeset, which introduces Parquet-based packed sequence support for improved scalability across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch romeyn/parquet-sequence-pack

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/megatron/bridge/data/datasets/packed_sequence.py (1)
77-98: ⚠️ Potential issue | 🟠 Major

Handle pad_seq_to_mult=None before comparing it.

This function explicitly supports None and even normalizes it on Line 81, but Line 98 still does pad_seq_to_mult > 1 unconditionally. A caller using None will get a TypeError during packing instead of disabling extra pre-padding.
🔧 Minimal fix
-    if pad_seq_to_mult > 1:
+    if pad_seq_to_mult is not None and pad_seq_to_mult > 1:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/datasets/packed_sequence.py` around lines 77 - 98,
The code currently compares pad_seq_to_mult > 1 unguarded which raises TypeError
when pad_seq_to_mult is None; fix by using the normalized value or an explicit
None check—replace the unconditional check with either "if pad_seq_to_mult is
not None and pad_seq_to_mult > 1" or "if pad_seq_length_to_mult > 1" (since
pad_seq_length_to_mult is set to 1 when pad_seq_to_mult is None) in
packed_sequence.py so extra pre-padding is only applied when a valid multiplier
was provided.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/data/builders/finetuning_dataset.py`:
- Around line 142-150: The current check in the builder where
packed_path_str.lower().endswith(".npy") only emits a DeprecationWarning and
returns, which lets the build proceed and later fail with a None-packed path;
change this to fail fast by raising an explicit error (e.g., ValueError or
RuntimeError) in the same branch inside the method handling packed paths (the
block referencing packed_path_str and .endswith(".npy")), replacing the
warnings.warn + return with a raised exception that includes the deprecation
message and actionable guidance to use packed parquet format so callers
immediately see the unsupported input.

In `@src/megatron/bridge/data/datasets/packed_sequence.py`:
- Around line 272-320: _validate_packed_path currently requires packed parquet
specs to resolve to existing files which prevents using explicit output targets;
change the logic inside the is_packed_parquet_spec branch to skip resolution
when the spec is a literal single-file target (no glob characters like '*?[]'
and not a directory spec), i.e. treat plain file paths (e.g.
"custom/train.idx.parquet") as valid write targets and simply setattr(self,
attr_name, path_str) without calling resolve_packed_parquet_paths; continue to
call resolve_packed_parquet_paths and raise FileNotFoundError for patterns/globs
or directory-like specs, referencing _validate_packed_path,
is_packed_parquet_spec, and resolve_packed_parquet_paths to locate where to
change the behavior.

In `@src/megatron/bridge/data/datasets/sft.py`:
- Around line 210-216: The packed-Parquet branch instantiates
GPTSFTPackedParquetDataset without forwarding the pad_seq_to_mult argument,
causing _pad_seq_to_mult to remain 1 in GPTSFTPackedDataset.collate_fn; modify
the GPTSFTPackedParquetDataset constructor call to pass
pad_seq_to_mult=pad_seq_to_mult (same name used by the .npy branch) so the
collate_fn receives the intended padding multiplier and emits
cu_seqlens_unpadded correctly.

In `@tests/unit_tests/data/datasets/test_chat_template.py`:
- Around line 452-535: Update the tests to patch the actual lazy import location
and fix the expectation for regular .parquet routing: change patch targets from
"megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset" to
"megatron.bridge.data.datasets.packed_parquet.GPTSFTPackedParquetDataset" for
tests that expect packed routing, and in
test_regular_parquet_not_routed_to_packed either assert
GPTSFTPackedParquetDataset is called (since is_packed_parquet_spec() treats
*.parquet as packed) or change the test path to a non-parquet extension (e.g.,
.json) if you want to assert routing to GPTSFTChatDataset; reference
create_sft_dataset, GPTSFTPackedParquetDataset, GPTSFTChatDataset, and
is_packed_parquet_spec when making the edits.

---

Outside diff comments:
In `@src/megatron/bridge/data/datasets/packed_sequence.py`:
- Around line 77-98: The code currently compares pad_seq_to_mult > 1 unguarded
which raises TypeError when pad_seq_to_mult is None; fix by using the normalized
value or an explicit None check—replace the unconditional check with either "if
pad_seq_to_mult is not None and pad_seq_to_mult > 1" or "if
pad_seq_length_to_mult > 1" (since pad_seq_length_to_mult is set to 1 when
pad_seq_to_mult is None) in packed_sequence.py so extra pre-padding is only
applied when a valid multiplier was provided.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70b3ae17-3591-4d4c-9dfe-cecade391318

📥 Commits

Reviewing files that changed from the base of the PR and between 07f1b5c and 23fc194.

📒 Files selected for processing (7)

pyproject.toml
src/megatron/bridge/data/builders/finetuning_dataset.py
src/megatron/bridge/data/datasets/packed_parquet.py
src/megatron/bridge/data/datasets/packed_sequence.py
src/megatron/bridge/data/datasets/sft.py
tests/unit_tests/data/datasets/test_chat_template.py
tests/unit_tests/data/datasets/test_packed_parquet.py

coderabbitai · 2026-03-06T17:24:19Z

src/megatron/bridge/data/builders/finetuning_dataset.py

+        packed_path_str = str(packed_path)
+        if packed_path_str.lower().endswith(".npy"):
+            warnings.warn(
+                "Automatic .npy packed sequence preparation is deprecated and will be removed in the next release. "
+                "Please use packed parquet format instead.",
+                DeprecationWarning,
+                stacklevel=3,
+            )
+            return


⚠️ Potential issue | 🟠 Major

Fail fast when automatic .npy preparation is no longer supported.

This branch only warns and returns, so the build can continue until _create_dataset() later hands back None for the missing packed path. Raising a clear error here would make the deprecation actionable.

🔧 Suggested change

if packed_path_str.lower().endswith(".npy"): - warnings.warn( - "Automatic .npy packed sequence preparation is deprecated and will be removed in the next release. " - "Please use packed parquet format instead.", - DeprecationWarning, - stacklevel=3, - ) - return + raise NotImplementedError( + "Automatic .npy packed sequence preparation is deprecated and no longer supported. " + "Please switch the packed output path to `.parquet` or `.pq`." + )

Based on learnings, when a feature is not supported, raise an explicit error instead of silently ignoring the input to fail fast with a clear message.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/data/builders/finetuning_dataset.py` around lines 142 - 150, The current check in the builder where packed_path_str.lower().endswith(".npy") only emits a DeprecationWarning and returns, which lets the build proceed and later fail with a None-packed path; change this to fail fast by raising an explicit error (e.g., ValueError or RuntimeError) in the same branch inside the method handling packed paths (the block referencing packed_path_str and .endswith(".npy")), replacing the warnings.warn + return with a raised exception that includes the deprecation message and actionable guidance to use packed parquet format so callers immediately see the unsupported input.

coderabbitai · 2026-03-06T17:24:19Z

src/megatron/bridge/data/datasets/packed_sequence.py

+    def _validate_packed_path(self, attr_name: str, path_value: str) -> None:
+        """Validate a packed data path and store it appropriately.
+
+        For .npy files: strict validation with Path.exists()
+        For packed parquet specs: validate via resolution (supports dirs/globs)
+
+        Args:
+            attr_name: The attribute name being validated (for error messages)
+            path_value: The path value to validate
+
+        Raises:
+            FileNotFoundError: If the path does not exist or resolves to no files
+            ValueError: If the path format is invalid
+        """
+        path_str = str(path_value)
+
+        # Check if it's an .npy file (legacy format)
+        if path_str.lower().endswith(".npy"):
+            warnings.warn(
+                f"The .npy packed sequence format is deprecated and will be removed in the next release. "
+                f"Please use packed parquet format instead. Path: {path_str}",
+                DeprecationWarning,
+                stacklevel=2,
+            )
            if MultiStorageClientFeature.is_enabled():
                msc = MultiStorageClientFeature.import_package()
-                self.packed_val_data_path = msc.Path(self.packed_val_data_path)
+                path_obj = msc.Path(path_str)
            else:
-                self.packed_val_data_path = Path(self.packed_val_data_path)
-            assert self.packed_val_data_path.suffix == ".npy", (
-                f"packed validation data file must be a .npy file: {self.packed_val_data_path}"
-            )
-            assert self.packed_val_data_path.exists(), (
-                f"packed validation data file does not exist: {self.packed_val_data_path}"
-            )
+                path_obj = Path(path_str)

-        if self.pad_seq_to_mult is not None and self.pad_seq_to_mult <= 0:
-            raise ValueError("pad_seq_to_mult must be a positive integer when provided.")
+            if not path_obj.exists():
+                raise FileNotFoundError(f"{attr_name} file does not exist: {path_str}")
+            setattr(self, attr_name, path_obj)
+            return
+
+        # Check if it's a packed parquet spec (file/dir/glob)
+        if is_packed_parquet_spec(path_str):
+            # Validate by resolving - this checks that files actually exist
+            try:
+                resolved_paths = resolve_packed_parquet_paths(path_str)
+                if len(resolved_paths) == 0:
+                    raise FileNotFoundError(f"{attr_name} resolved to no files: {path_str}")
+            except ValueError as e:
+                raise FileNotFoundError(f"{attr_name} could not be resolved: {path_str}. Error: {e}") from e
+
+            # Store the original string spec (not Path) to preserve globs
+            # The dataset loader will handle resolution
+            setattr(self, attr_name, path_str)
+            return


⚠️ Potential issue | 🟠 Major

Don't require custom packed output files to pre-exist.

FinetuningDatasetBuilder._prepare_packed_split() uses packed_train_data_path and packed_val_data_path as write targets when packed data is missing. Requiring something like custom/train.idx.parquet to resolve here makes explicit output paths unusable unless the file is already there.

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 303-303: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 313-313: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 315-315: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/data/datasets/packed_sequence.py` around lines 272 - 320, _validate_packed_path currently requires packed parquet specs to resolve to existing files which prevents using explicit output targets; change the logic inside the is_packed_parquet_spec branch to skip resolution when the spec is a literal single-file target (no glob characters like '*?[]' and not a directory spec), i.e. treat plain file paths (e.g. "custom/train.idx.parquet") as valid write targets and simply setattr(self, attr_name, path_str) without calling resolve_packed_parquet_paths; continue to call resolve_packed_parquet_paths and raise FileNotFoundError for patterns/globs or directory-like specs, referencing _validate_packed_path, is_packed_parquet_spec, and resolve_packed_parquet_paths to locate where to change the behavior.

coderabbitai · 2026-03-06T17:24:19Z

src/megatron/bridge/data/datasets/sft.py

+    if is_packed_parquet_spec(path_str):
+        return GPTSFTPackedParquetDataset(
+            pack_metadata_file_path=pack_metadata_file_path,
+            pad_cu_seqlens=pad_cu_seqlens,
+            **gpt_sft_dataset_kwargs,
+            **kwargs,
+        )


⚠️ Potential issue | 🟠 Major

Forward pad_seq_to_mult in the packed-Parquet branch.

The .npy path passes this through, but the Parquet path drops it. That leaves _pad_seq_to_mult at 1 inside GPTSFTPackedDataset.collate_fn(), so packed Parquet runs never emit cu_seqlens_unpadded for CP/THD-prepared data.

🔧 Minimal fix

if is_packed_parquet_spec(path_str): return GPTSFTPackedParquetDataset( pack_metadata_file_path=pack_metadata_file_path, pad_cu_seqlens=pad_cu_seqlens, + pad_seq_to_mult=pad_seq_to_mult, **gpt_sft_dataset_kwargs, **kwargs, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/data/datasets/sft.py` around lines 210 - 216, The packed-Parquet branch instantiates GPTSFTPackedParquetDataset without forwarding the pad_seq_to_mult argument, causing _pad_seq_to_mult to remain 1 in GPTSFTPackedDataset.collate_fn; modify the GPTSFTPackedParquetDataset constructor call to pass pad_seq_to_mult=pad_seq_to_mult (same name used by the .npy branch) so the collate_fn receives the intended padding multiplier and emits cu_seqlens_unpadded correctly.

coderabbitai · 2026-03-06T17:24:19Z

tests/unit_tests/data/datasets/test_chat_template.py

+    @patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
+    def test_create_packed_parquet_dataset_idx_parquet(self, mock_parquet_class):
+        """Test that .idx.parquet files create GPTSFTPackedParquetDataset."""
+        from pathlib import Path
+
+        mock_tokenizer = MagicMock()
+        mock_parquet_class.return_value = MagicMock()
+
+        create_sft_dataset(
+            path=Path("test.idx.parquet"),
+            tokenizer=mock_tokenizer,
+        )
+
+        # Verify GPTSFTPackedParquetDataset was called
+        mock_parquet_class.assert_called_once()
+
+    @patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
+    def test_create_packed_parquet_dataset_idx_pq(self, mock_parquet_class):
+        """Test that .idx.pq files create GPTSFTPackedParquetDataset."""
+        from pathlib import Path
+
+        mock_tokenizer = MagicMock()
+        mock_parquet_class.return_value = MagicMock()
+
+        create_sft_dataset(
+            path=Path("test.idx.pq"),
+            tokenizer=mock_tokenizer,
+        )
+
+        # Verify GPTSFTPackedParquetDataset was called
+        mock_parquet_class.assert_called_once()
+
+    @patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
+    def test_create_packed_parquet_dataset_priority_over_chat(self, mock_parquet_class):
+        """Test that packed Parquet files take precedence over chat=True."""
+        from pathlib import Path
+
+        mock_tokenizer = MagicMock()
+        mock_parquet_class.return_value = MagicMock()
+
+        create_sft_dataset(
+            path=Path("test.idx.parquet"),
+            tokenizer=mock_tokenizer,
+            chat=True,  # Should be ignored for packed Parquet files
+            use_hf_tokenizer_chat_template=True,
+        )
+
+        # Verify GPTSFTPackedParquetDataset was called (not GPTSFTChatDataset)
+        mock_parquet_class.assert_called_once()
+
+    @patch("megatron.bridge.data.datasets.sft.GPTSFTChatDataset")
+    def test_regular_parquet_not_routed_to_packed(self, mock_chat_class):
+        """Test that regular .parquet files (without .idx.) are NOT routed to packed dataset."""
+        from pathlib import Path
+
+        mock_tokenizer = MagicMock()
+        mock_chat_class.return_value = MagicMock()
+
+        create_sft_dataset(
+            path=Path("test.parquet"),  # No .idx. prefix
+            tokenizer=mock_tokenizer,
+            chat=True,
+            use_hf_tokenizer_chat_template=True,
+        )
+
+        # Verify GPTSFTChatDataset was called (regular parquet goes to chat/default)
+        mock_chat_class.assert_called_once()
+
+    @patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
+    def test_create_packed_parquet_glob_pattern(self, mock_parquet_class):
+        """Test that glob patterns like data*.idx.parquet route to GPTSFTPackedParquetDataset."""
+        from pathlib import Path
+
+        mock_tokenizer = MagicMock()
+        mock_parquet_class.return_value = MagicMock()
+
+        create_sft_dataset(
+            path=Path("data/shard_*.idx.parquet"),  # Glob pattern
+            tokenizer=mock_tokenizer,
+        )
+
+        # Verify GPTSFTPackedParquetDataset was called
+        mock_parquet_class.assert_called_once()
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check where GPTSFTPackedParquetDataset is imported/used in sft.py echo "=== GPTSFTPackedParquetDataset usage in sft.py ===" rg -n -C3 'GPTSFTPackedParquetDataset' \ src/megatron/bridge/data/datasets/sft.py echo -e "\n=== is_packed_parquet_spec and suffix handling in packed_parquet.py ===" rg -n -C3 'def is_packed_parquet_spec|endswith.*parquet|endswith.*\.pq' \ src/megatron/bridge/data/datasets/packed_parquet.py echo -e "\n=== Test patches in test_chat_template.py (lines 452-535) ===" sed -n '452,535p' tests/unit_tests/data/datasets/test_chat_template.py | grep -n '@patch\|def test_'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3450

Fix patch targets and correct the .parquet routing expectation.

The patch targets must reference the actual import location. Since GPTSFTPackedParquetDataset is imported lazily inside create_sft_dataset() from megatron.bridge.data.datasets.packed_parquet, patch megatron.bridge.data.datasets.packed_parquet.GPTSFTPackedParquetDataset instead of sft.GPTSFTPackedParquetDataset.

Additionally, test_regular_parquet_not_routed_to_packed has an incorrect expectation. The is_packed_parquet_spec() function in packed_parquet.py returns True for any path ending with .parquet or .pq (line 87), so a regular .parquet file will be routed to GPTSFTPackedParquetDataset, not GPTSFTChatDataset. Either adjust the test to expect packed routing, or change the path to a non-parquet extension if testing non-packed behavior is intended.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/data/datasets/test_chat_template.py` around lines 452 - 535, Update the tests to patch the actual lazy import location and fix the expectation for regular .parquet routing: change patch targets from "megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset" to "megatron.bridge.data.datasets.packed_parquet.GPTSFTPackedParquetDataset" for tests that expect packed routing, and in test_regular_parquet_not_routed_to_packed either assert GPTSFTPackedParquetDataset is called (since is_packed_parquet_spec() treats *.parquet as packed) or change the test path to a non-parquet extension (e.g., .json) if you want to assert routing to GPTSFTChatDataset; reference create_sft_dataset, GPTSFTPackedParquetDataset, GPTSFTChatDataset, and is_packed_parquet_spec when making the edits.

yaoyu-33 · 2026-03-10T05:26:29Z

@marcromeyn can you resolve code rabbit if not applicable? I do see some of these are reasonable comments.

yaoyu-33 · 2026-03-14T01:21:38Z

@marcromeyn ^^

Proposal for more scalable seq-pack data format

812946b

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

marcromeyn added 3 commits January 28, 2026 16:33

Some improvements

db6f933

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Some improvements

a8e77c5

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Fixing read data

604c3d5

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

cuichenx self-requested a review February 10, 2026 17:20

yaoyu-33 reviewed Feb 10, 2026

View reviewed changes

src/megatron/bridge/data/datasets/packed_sequence.py Show resolved Hide resolved

yaoyu-33 reviewed Feb 10, 2026

View reviewed changes

src/megatron/bridge/data/builders/finetuning_dataset.py Outdated Show resolved Hide resolved

marcromeyn added 3 commits February 16, 2026 17:14

Incorporating PR review

d6773d5

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Merge branch 'main' into romeyn/parquet-sequence-pack

1297971

Merge branch 'main' into romeyn/parquet-sequence-pack

b79f75f

yaoyu-33 previously approved these changes Feb 18, 2026

View reviewed changes

yaoyu-33 reviewed Feb 18, 2026

View reviewed changes

Merge branch 'main' into romeyn/parquet-sequence-pack

e6e40ec

copy-pr-bot bot had a problem deploying to nemo-ci February 18, 2026 16:47 Failure

marcromeyn added 2 commits February 25, 2026 08:19

Fixing linting

e080c5b

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

Merge branch 'romeyn/parquet-sequence-pack' of github.com:NVIDIA-NeMo…

7d33380

…/Megatron-Bridge into romeyn/parquet-sequence-pack

marcromeyn dismissed yaoyu-33’s stale review via 7d33380 February 25, 2026 07:20

Merge branch 'main' into romeyn/parquet-sequence-pack

3a8d112

copy-pr-bot bot temporarily deployed to test February 25, 2026 07:21 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 25, 2026 07:28 Failure

cuichenx mentioned this pull request Feb 25, 2026

Improving packed sequences SFT for large datasets #2395

Merged

4 tasks

Bug fix

23fc194

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

marcromeyn marked this pull request as ready for review March 6, 2026 17:08

marcromeyn requested a review from a team as a code owner March 6, 2026 17:08

Return empty list instead of throwing an error

fd5150c

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

yaoyu-33 added the needs-author Author action is required before review or merge can continue label Mar 14, 2026

Conversation

marcromeyn commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

yaoyu-33 Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcromeyn commented Feb 18, 2026

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

marcromeyn commented Feb 18, 2026

Uh oh!

marcromeyn commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 10, 2026

Uh oh!

yaoyu-33 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcromeyn commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

yaoyu-33 Feb 18, 2026 •

edited

Loading