Skip to content

Proposal for more scalable seq-pack data format#2019

Open
marcromeyn wants to merge 13 commits intomainfrom
romeyn/parquet-sequence-pack
Open

Proposal for more scalable seq-pack data format#2019
marcromeyn wants to merge 13 commits intomainfrom
romeyn/parquet-sequence-pack

Conversation

@marcromeyn
Copy link
Contributor

@marcromeyn marcromeyn commented Jan 21, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Parquet-based packed datasets for improved performance and compatibility.
    • Packed dataset output now defaults to Parquet format instead of legacy binary format.
  • Deprecations

    • Legacy binary packed sequence format is deprecated; users are encouraged to migrate to Parquet format.
  • Dependencies

    • Added optional pyarrow>=14.0.0 dependency for Parquet support.
  • Tests

    • Added comprehensive test coverage for Parquet-based packed datasets and validation logic.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
@cuichenx cuichenx self-requested a review February 10, 2026 17:20
yaoyu-33
yaoyu-33 previously approved these changes Feb 18, 2026

if path.suffix == ".npy":
# Check for .npy packed dataset (legacy format)
if path_str.lower().endswith(".npy"):
Copy link
Contributor

@yaoyu-33 yaoyu-33 Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a deprecation msg here for npy. instruct people to re-run data generation

@marcromeyn
Copy link
Contributor Author

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 18, 2026

/ok to test

@marcromeyn, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@marcromeyn
Copy link
Contributor Author

/ok to test e6e40ec

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
…/Megatron-Bridge into romeyn/parquet-sequence-pack
@marcromeyn
Copy link
Contributor Author

/ok to test 3a8d112

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
@marcromeyn marcromeyn marked this pull request as ready for review March 6, 2026 17:08
@marcromeyn marcromeyn requested a review from a team as a code owner March 6, 2026 17:08
Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

The changes add comprehensive support for packed Parquet datasets in the supervised fine-tuning module. This includes introducing pyarrow as an optional dependency, implementing a new GPTSFTPackedParquetDataset class with path resolution and validation utilities, updating the dataset builder to handle parquet formats alongside legacy .npy paths, and deprecating .npy packed sequence handling in favor of parquet format.

Changes

Cohort / File(s) Summary
Configuration
pyproject.toml
Added optional dependency group for parquet with pyarrow>=14.0.0.
Packed Parquet Core Support
src/megatron/bridge/data/datasets/packed_parquet.py
New file implementing full packed Parquet dataset support: is_packed_parquet_file/spec detection, _resolve_parquet_paths glob/directory expansion, write_packed_parquet output writer, and GPTSFTPackedParquetDataset class with lazy pyarrow reader lifecycle, row-group caching, multi-file support, and per-sample retrieval.
Dataset Builder Updates
src/megatron/bridge/data/builders/finetuning_dataset.py
Refactored packed data handling to support parquet specs via new _prepare_packed_split helper and _packed_path_exists validation. Updated default output paths from .npy to .idx.parquet. Added deprecation warning for automatic .npy packed sequence preparation.
Dataset Factory & Routing
src/megatron/bridge/data/datasets/sft.py
Enhanced create_sft_dataset to route packed parquet specs (detected via is_packed_parquet_spec) to GPTSFTPackedParquetDataset. Updated parameter types to accept str | Path for flexibility. Maintains legacy .npy routing and chat template precedence.
Packed Sequence Path Validation
src/megatron/bridge/data/datasets/packed_sequence.py
Added _validate_packed_path helper to validate .npy paths with deprecation warning and validate parquet specs via is_packed_parquet_spec/resolve_packed_parquet_paths. Updated __post_init__ to use parquet writer for .parquet/.pq outputs.
Test Coverage
tests/unit_tests/data/datasets/test_chat_template.py, tests/unit_tests/data/datasets/test_packed_parquet.py
Added 132 lines of tests validating packed parquet routing precedence, glob patterns, and .idx.parquet file detection. Added 383 lines of comprehensive integration tests for GPTSFTPackedParquetDataset covering initialization, multi-file handling, row-group caching, schema validation, and resource cleanup.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Factory as SFT Factory<br/>(create_sft_dataset)
    participant PathDetector as Path Detection<br/>(is_packed_parquet_spec)
    participant Dataset as GPTSFTPackedParquetDataset
    participant Resolver as Path Resolver<br/>(_resolve_parquet_paths)
    participant FileSystem as File System
    participant PyArrow as PyArrow Reader

    Client->>Factory: create_sft_dataset(path, ...)
    Factory->>PathDetector: is_packed_parquet_spec(path)?
    PathDetector->>FileSystem: glob/stat path
    FileSystem-->>PathDetector: matches .parquet/.pq
    PathDetector-->>Factory: true
    
    Factory->>Dataset: __init__(file_path, tokenizer, ...)
    Dataset->>Resolver: resolve_packed_parquet_paths(spec)
    Resolver->>FileSystem: glob/list files
    FileSystem-->>Resolver: sorted file list
    Resolver-->>Dataset: [file1.parquet, file2.parquet, ...]
    
    Dataset->>Dataset: _load_dataset()
    Dataset->>PyArrow: lazy initialize readers
    Dataset->>Dataset: build cumulative row/rowgroup offsets
    Dataset-->>Factory: initialized dataset
    
    Client->>Dataset: __getitem__(idx)
    Dataset->>Dataset: _locate_row(idx) via offsets
    Dataset->>PyArrow: _ensure_reader(file_idx)
    PyArrow-->>Dataset: reader handle
    Dataset->>PyArrow: read row_group
    PyArrow-->>Dataset: {input_ids, loss_mask, seq_start_id}
    Dataset->>Dataset: compute seq_boundaries
    Dataset-->>Client: {input_ids, seq_boundaries, loss_mask}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Run CICD

Suggested reviewers

  • cuichenx
  • yaoyu-33
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ❓ Inconclusive PR introduces major changes (648-line Parquet module, core logic modifications, pyarrow dependency) with comprehensive tests, but lacks documented test results, performance metrics, convergence validation, and has placeholder PR description text with unresolved review comments. PR author should document test pass confirmation with coverage summary, performance benchmarks comparing Parquet vs .npy formats, convergence validation, and address all review comments.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Proposal for more scalable seq-pack data format' accurately reflects the main objective of the changeset, which introduces Parquet-based packed sequence support for improved scalability across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch romeyn/parquet-sequence-pack

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/data/datasets/packed_sequence.py (1)

77-98: ⚠️ Potential issue | 🟠 Major

Handle pad_seq_to_mult=None before comparing it.

This function explicitly supports None and even normalizes it on Line 81, but Line 98 still does pad_seq_to_mult > 1 unconditionally. A caller using None will get a TypeError during packing instead of disabling extra pre-padding.

🔧 Minimal fix
-    if pad_seq_to_mult > 1:
+    if pad_seq_to_mult is not None and pad_seq_to_mult > 1:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/datasets/packed_sequence.py` around lines 77 - 98,
The code currently compares pad_seq_to_mult > 1 unguarded which raises TypeError
when pad_seq_to_mult is None; fix by using the normalized value or an explicit
None check—replace the unconditional check with either "if pad_seq_to_mult is
not None and pad_seq_to_mult > 1" or "if pad_seq_length_to_mult > 1" (since
pad_seq_length_to_mult is set to 1 when pad_seq_to_mult is None) in
packed_sequence.py so extra pre-padding is only applied when a valid multiplier
was provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/data/builders/finetuning_dataset.py`:
- Around line 142-150: The current check in the builder where
packed_path_str.lower().endswith(".npy") only emits a DeprecationWarning and
returns, which lets the build proceed and later fail with a None-packed path;
change this to fail fast by raising an explicit error (e.g., ValueError or
RuntimeError) in the same branch inside the method handling packed paths (the
block referencing packed_path_str and .endswith(".npy")), replacing the
warnings.warn + return with a raised exception that includes the deprecation
message and actionable guidance to use packed parquet format so callers
immediately see the unsupported input.

In `@src/megatron/bridge/data/datasets/packed_sequence.py`:
- Around line 272-320: _validate_packed_path currently requires packed parquet
specs to resolve to existing files which prevents using explicit output targets;
change the logic inside the is_packed_parquet_spec branch to skip resolution
when the spec is a literal single-file target (no glob characters like '*?[]'
and not a directory spec), i.e. treat plain file paths (e.g.
"custom/train.idx.parquet") as valid write targets and simply setattr(self,
attr_name, path_str) without calling resolve_packed_parquet_paths; continue to
call resolve_packed_parquet_paths and raise FileNotFoundError for patterns/globs
or directory-like specs, referencing _validate_packed_path,
is_packed_parquet_spec, and resolve_packed_parquet_paths to locate where to
change the behavior.

In `@src/megatron/bridge/data/datasets/sft.py`:
- Around line 210-216: The packed-Parquet branch instantiates
GPTSFTPackedParquetDataset without forwarding the pad_seq_to_mult argument,
causing _pad_seq_to_mult to remain 1 in GPTSFTPackedDataset.collate_fn; modify
the GPTSFTPackedParquetDataset constructor call to pass
pad_seq_to_mult=pad_seq_to_mult (same name used by the .npy branch) so the
collate_fn receives the intended padding multiplier and emits
cu_seqlens_unpadded correctly.

In `@tests/unit_tests/data/datasets/test_chat_template.py`:
- Around line 452-535: Update the tests to patch the actual lazy import location
and fix the expectation for regular .parquet routing: change patch targets from
"megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset" to
"megatron.bridge.data.datasets.packed_parquet.GPTSFTPackedParquetDataset" for
tests that expect packed routing, and in
test_regular_parquet_not_routed_to_packed either assert
GPTSFTPackedParquetDataset is called (since is_packed_parquet_spec() treats
*.parquet as packed) or change the test path to a non-parquet extension (e.g.,
.json) if you want to assert routing to GPTSFTChatDataset; reference
create_sft_dataset, GPTSFTPackedParquetDataset, GPTSFTChatDataset, and
is_packed_parquet_spec when making the edits.

---

Outside diff comments:
In `@src/megatron/bridge/data/datasets/packed_sequence.py`:
- Around line 77-98: The code currently compares pad_seq_to_mult > 1 unguarded
which raises TypeError when pad_seq_to_mult is None; fix by using the normalized
value or an explicit None check—replace the unconditional check with either "if
pad_seq_to_mult is not None and pad_seq_to_mult > 1" or "if
pad_seq_length_to_mult > 1" (since pad_seq_length_to_mult is set to 1 when
pad_seq_to_mult is None) in packed_sequence.py so extra pre-padding is only
applied when a valid multiplier was provided.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70b3ae17-3591-4d4c-9dfe-cecade391318

📥 Commits

Reviewing files that changed from the base of the PR and between 07f1b5c and 23fc194.

📒 Files selected for processing (7)
  • pyproject.toml
  • src/megatron/bridge/data/builders/finetuning_dataset.py
  • src/megatron/bridge/data/datasets/packed_parquet.py
  • src/megatron/bridge/data/datasets/packed_sequence.py
  • src/megatron/bridge/data/datasets/sft.py
  • tests/unit_tests/data/datasets/test_chat_template.py
  • tests/unit_tests/data/datasets/test_packed_parquet.py

Comment on lines +142 to +150
packed_path_str = str(packed_path)
if packed_path_str.lower().endswith(".npy"):
warnings.warn(
"Automatic .npy packed sequence preparation is deprecated and will be removed in the next release. "
"Please use packed parquet format instead.",
DeprecationWarning,
stacklevel=3,
)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when automatic .npy preparation is no longer supported.

This branch only warns and returns, so the build can continue until _create_dataset() later hands back None for the missing packed path. Raising a clear error here would make the deprecation actionable.

🔧 Suggested change
         if packed_path_str.lower().endswith(".npy"):
-            warnings.warn(
-                "Automatic .npy packed sequence preparation is deprecated and will be removed in the next release. "
-                "Please use packed parquet format instead.",
-                DeprecationWarning,
-                stacklevel=3,
-            )
-            return
+            raise NotImplementedError(
+                "Automatic .npy packed sequence preparation is deprecated and no longer supported. "
+                "Please switch the packed output path to `.parquet` or `.pq`."
+            )
Based on learnings, when a feature is not supported, raise an explicit error instead of silently ignoring the input to fail fast with a clear message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/builders/finetuning_dataset.py` around lines 142 -
150, The current check in the builder where
packed_path_str.lower().endswith(".npy") only emits a DeprecationWarning and
returns, which lets the build proceed and later fail with a None-packed path;
change this to fail fast by raising an explicit error (e.g., ValueError or
RuntimeError) in the same branch inside the method handling packed paths (the
block referencing packed_path_str and .endswith(".npy")), replacing the
warnings.warn + return with a raised exception that includes the deprecation
message and actionable guidance to use packed parquet format so callers
immediately see the unsupported input.

Comment on lines +272 to +320
def _validate_packed_path(self, attr_name: str, path_value: str) -> None:
"""Validate a packed data path and store it appropriately.

For .npy files: strict validation with Path.exists()
For packed parquet specs: validate via resolution (supports dirs/globs)

Args:
attr_name: The attribute name being validated (for error messages)
path_value: The path value to validate

Raises:
FileNotFoundError: If the path does not exist or resolves to no files
ValueError: If the path format is invalid
"""
path_str = str(path_value)

# Check if it's an .npy file (legacy format)
if path_str.lower().endswith(".npy"):
warnings.warn(
f"The .npy packed sequence format is deprecated and will be removed in the next release. "
f"Please use packed parquet format instead. Path: {path_str}",
DeprecationWarning,
stacklevel=2,
)
if MultiStorageClientFeature.is_enabled():
msc = MultiStorageClientFeature.import_package()
self.packed_val_data_path = msc.Path(self.packed_val_data_path)
path_obj = msc.Path(path_str)
else:
self.packed_val_data_path = Path(self.packed_val_data_path)
assert self.packed_val_data_path.suffix == ".npy", (
f"packed validation data file must be a .npy file: {self.packed_val_data_path}"
)
assert self.packed_val_data_path.exists(), (
f"packed validation data file does not exist: {self.packed_val_data_path}"
)
path_obj = Path(path_str)

if self.pad_seq_to_mult is not None and self.pad_seq_to_mult <= 0:
raise ValueError("pad_seq_to_mult must be a positive integer when provided.")
if not path_obj.exists():
raise FileNotFoundError(f"{attr_name} file does not exist: {path_str}")
setattr(self, attr_name, path_obj)
return

# Check if it's a packed parquet spec (file/dir/glob)
if is_packed_parquet_spec(path_str):
# Validate by resolving - this checks that files actually exist
try:
resolved_paths = resolve_packed_parquet_paths(path_str)
if len(resolved_paths) == 0:
raise FileNotFoundError(f"{attr_name} resolved to no files: {path_str}")
except ValueError as e:
raise FileNotFoundError(f"{attr_name} could not be resolved: {path_str}. Error: {e}") from e

# Store the original string spec (not Path) to preserve globs
# The dataset loader will handle resolution
setattr(self, attr_name, path_str)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't require custom packed output files to pre-exist.

FinetuningDatasetBuilder._prepare_packed_split() uses packed_train_data_path and packed_val_data_path as write targets when packed data is missing. Requiring something like custom/train.idx.parquet to resolve here makes explicit output paths unusable unless the file is already there.

🧰 Tools
🪛 Ruff (0.15.4)

[warning] 303-303: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 313-313: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 315-315: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/datasets/packed_sequence.py` around lines 272 - 320,
_validate_packed_path currently requires packed parquet specs to resolve to
existing files which prevents using explicit output targets; change the logic
inside the is_packed_parquet_spec branch to skip resolution when the spec is a
literal single-file target (no glob characters like '*?[]' and not a directory
spec), i.e. treat plain file paths (e.g. "custom/train.idx.parquet") as valid
write targets and simply setattr(self, attr_name, path_str) without calling
resolve_packed_parquet_paths; continue to call resolve_packed_parquet_paths and
raise FileNotFoundError for patterns/globs or directory-like specs, referencing
_validate_packed_path, is_packed_parquet_spec, and resolve_packed_parquet_paths
to locate where to change the behavior.

Comment on lines +210 to +216
if is_packed_parquet_spec(path_str):
return GPTSFTPackedParquetDataset(
pack_metadata_file_path=pack_metadata_file_path,
pad_cu_seqlens=pad_cu_seqlens,
**gpt_sft_dataset_kwargs,
**kwargs,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Forward pad_seq_to_mult in the packed-Parquet branch.

The .npy path passes this through, but the Parquet path drops it. That leaves _pad_seq_to_mult at 1 inside GPTSFTPackedDataset.collate_fn(), so packed Parquet runs never emit cu_seqlens_unpadded for CP/THD-prepared data.

🔧 Minimal fix
     if is_packed_parquet_spec(path_str):
         return GPTSFTPackedParquetDataset(
             pack_metadata_file_path=pack_metadata_file_path,
             pad_cu_seqlens=pad_cu_seqlens,
+            pad_seq_to_mult=pad_seq_to_mult,
             **gpt_sft_dataset_kwargs,
             **kwargs,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/datasets/sft.py` around lines 210 - 216, The
packed-Parquet branch instantiates GPTSFTPackedParquetDataset without forwarding
the pad_seq_to_mult argument, causing _pad_seq_to_mult to remain 1 in
GPTSFTPackedDataset.collate_fn; modify the GPTSFTPackedParquetDataset
constructor call to pass pad_seq_to_mult=pad_seq_to_mult (same name used by the
.npy branch) so the collate_fn receives the intended padding multiplier and
emits cu_seqlens_unpadded correctly.

Comment on lines +452 to +535
@patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
def test_create_packed_parquet_dataset_idx_parquet(self, mock_parquet_class):
"""Test that .idx.parquet files create GPTSFTPackedParquetDataset."""
from pathlib import Path

mock_tokenizer = MagicMock()
mock_parquet_class.return_value = MagicMock()

create_sft_dataset(
path=Path("test.idx.parquet"),
tokenizer=mock_tokenizer,
)

# Verify GPTSFTPackedParquetDataset was called
mock_parquet_class.assert_called_once()

@patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
def test_create_packed_parquet_dataset_idx_pq(self, mock_parquet_class):
"""Test that .idx.pq files create GPTSFTPackedParquetDataset."""
from pathlib import Path

mock_tokenizer = MagicMock()
mock_parquet_class.return_value = MagicMock()

create_sft_dataset(
path=Path("test.idx.pq"),
tokenizer=mock_tokenizer,
)

# Verify GPTSFTPackedParquetDataset was called
mock_parquet_class.assert_called_once()

@patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
def test_create_packed_parquet_dataset_priority_over_chat(self, mock_parquet_class):
"""Test that packed Parquet files take precedence over chat=True."""
from pathlib import Path

mock_tokenizer = MagicMock()
mock_parquet_class.return_value = MagicMock()

create_sft_dataset(
path=Path("test.idx.parquet"),
tokenizer=mock_tokenizer,
chat=True, # Should be ignored for packed Parquet files
use_hf_tokenizer_chat_template=True,
)

# Verify GPTSFTPackedParquetDataset was called (not GPTSFTChatDataset)
mock_parquet_class.assert_called_once()

@patch("megatron.bridge.data.datasets.sft.GPTSFTChatDataset")
def test_regular_parquet_not_routed_to_packed(self, mock_chat_class):
"""Test that regular .parquet files (without .idx.) are NOT routed to packed dataset."""
from pathlib import Path

mock_tokenizer = MagicMock()
mock_chat_class.return_value = MagicMock()

create_sft_dataset(
path=Path("test.parquet"), # No .idx. prefix
tokenizer=mock_tokenizer,
chat=True,
use_hf_tokenizer_chat_template=True,
)

# Verify GPTSFTChatDataset was called (regular parquet goes to chat/default)
mock_chat_class.assert_called_once()

@patch("megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset")
def test_create_packed_parquet_glob_pattern(self, mock_parquet_class):
"""Test that glob patterns like data*.idx.parquet route to GPTSFTPackedParquetDataset."""
from pathlib import Path

mock_tokenizer = MagicMock()
mock_parquet_class.return_value = MagicMock()

create_sft_dataset(
path=Path("data/shard_*.idx.parquet"), # Glob pattern
tokenizer=mock_tokenizer,
)

# Verify GPTSFTPackedParquetDataset was called
mock_parquet_class.assert_called_once()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check where GPTSFTPackedParquetDataset is imported/used in sft.py
echo "=== GPTSFTPackedParquetDataset usage in sft.py ===" 
rg -n -C3 'GPTSFTPackedParquetDataset' \
  src/megatron/bridge/data/datasets/sft.py

echo -e "\n=== is_packed_parquet_spec and suffix handling in packed_parquet.py ==="
rg -n -C3 'def is_packed_parquet_spec|endswith.*parquet|endswith.*\.pq' \
  src/megatron/bridge/data/datasets/packed_parquet.py

echo -e "\n=== Test patches in test_chat_template.py (lines 452-535) ==="
sed -n '452,535p' tests/unit_tests/data/datasets/test_chat_template.py | grep -n '@patch\|def test_'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3450


Fix patch targets and correct the .parquet routing expectation.

The patch targets must reference the actual import location. Since GPTSFTPackedParquetDataset is imported lazily inside create_sft_dataset() from megatron.bridge.data.datasets.packed_parquet, patch megatron.bridge.data.datasets.packed_parquet.GPTSFTPackedParquetDataset instead of sft.GPTSFTPackedParquetDataset.

Additionally, test_regular_parquet_not_routed_to_packed has an incorrect expectation. The is_packed_parquet_spec() function in packed_parquet.py returns True for any path ending with .parquet or .pq (line 87), so a regular .parquet file will be routed to GPTSFTPackedParquetDataset, not GPTSFTChatDataset. Either adjust the test to expect packed routing, or change the path to a non-parquet extension if testing non-packed behavior is intended.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/data/datasets/test_chat_template.py` around lines 452 - 535,
Update the tests to patch the actual lazy import location and fix the
expectation for regular .parquet routing: change patch targets from
"megatron.bridge.data.datasets.sft.GPTSFTPackedParquetDataset" to
"megatron.bridge.data.datasets.packed_parquet.GPTSFTPackedParquetDataset" for
tests that expect packed routing, and in
test_regular_parquet_not_routed_to_packed either assert
GPTSFTPackedParquetDataset is called (since is_packed_parquet_spec() treats
*.parquet as packed) or change the test path to a non-parquet extension (e.g.,
.json) if you want to assert routing to GPTSFTChatDataset; reference
create_sft_dataset, GPTSFTPackedParquetDataset, GPTSFTChatDataset, and
is_packed_parquet_spec when making the edits.

@yaoyu-33
Copy link
Contributor

@marcromeyn can you resolve code rabbit if not applicable? I do see some of these are reasonable comments.

@yaoyu-33
Copy link
Contributor

@marcromeyn ^^

@yaoyu-33 yaoyu-33 added the needs-author Author action is required before review or merge can continue label Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-author Author action is required before review or merge can continue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants