Skip to content

Conversation

ChenhanYu
Copy link
Collaborator

@ChenhanYu ChenhanYu commented Sep 18, 2025

What does this PR do?

Type of change: ? new feature

Overview:

  1. We are adding an additional file modelopt_run_config.yaml to the Megatron-LM or Megatron-Bridge checkpoint that we process. It is essentially a dump of the TransformerConfig to YAML which can be handy for knowledge distillation. NeMo2 checkpoint has a similar file model_config.yaml and Megatron-Brdige has run_config.yaml. So it is mainly useful for Megatron-LM.
  2. We add a main function for megatron data preprocess. The minimum input is --tokenizer which accepts a PreTrainedTokenizer path. It can process datasets from HF hub directly by using --dataset. If subset and split are not provided, all subsets and splits will be processed.

Usage

 python megatron_preprocess_data.py \
     --dataset "nvidia/Nemotron-Pretraining-Dataset-sample" \
     --tokenizer"meta-llama/Llama-3.2-1B-Instruct" \
     --output_dir "./processed_data"```

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features
    • Checkpoint saves now include a companion run-configuration YAML (sanitized model settings + library version) alongside checkpoints to improve reproducibility; file is written only by the primary process.
    • New command-line interface for data preprocessing: fetch from Hugging Face or use local files, validate required fields, auto-generate JSONL inputs, and run tokenization/binarization with configurable dataset, split, tokenizer, keys, sequence length, workers, logging, and EOD handling.

@ChenhanYu ChenhanYu requested review from a team as code owners September 18, 2025 18:36
@ChenhanYu ChenhanYu self-assigned this Sep 18, 2025
Copy link

coderabbitai bot commented Sep 18, 2025

Walkthrough

Adds master-only YAML export of a sanitized transformer config during distributed checkpoint saving (filters keys, stringifies values, and writes modelopt_run_config.yaml), and adds a CLI entrypoint for Megatron preprocessing that can fetch HuggingFace dataset splits, validate json keys, write per-split JSONL files, and invoke the existing preprocessing function.

Changes

Cohort / File(s) Summary of Changes
Dist checkpoint YAML export
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
Adds import yaml and a global DROP_SUBSTRINGS list; introduces a nested _parse_transformer_config(transformer_config: dict) -> dict to sanitize/stringify config fields (keep primitives, stringify others, drop keys containing substrings); on master (dist.is_master()) writes {checkpoint_name}/modelopt_run_config.yaml using a deepcopy of model[0].config.__dict__ plus nvidia_modelopt_version; uses yaml.dump; no public API changes.
Megatron preprocessing CLI integration
modelopt/torch/utils/plugins/megatron_preprocess_data.py
Adds main() and if __name__ == "__main__" entry; imports argparse, requests, and datasets.load_dataset; CLI accepts input_path or HF dataset params (dataset, subset, split), output_dir, tokenizer and preprocessing args, json_keys, etc.; if no input_path, downloads/validates splits from HuggingFace, writes per-split output_dir/{dataset}_{subset}_{split}.jsonl, collects files and then calls existing megatron_preprocess_data; core preprocessing function unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Trainer
  participant SaveFn as save_sharded_modelopt_state
  participant Model
  participant FS as Filesystem

  User->>Trainer: Trigger distributed checkpoint save
  Trainer->>SaveFn: save_sharded_modelopt_state(model, checkpoint_name, ...)
  SaveFn->>Model: Read model[0].config.__dict__
  SaveFn->>SaveFn: _parse_transformer_config(config) [filter/drop/stringify]
  alt Master process only
    SaveFn->>FS: Write {checkpoint_name}/modelopt_run_config.yaml (yaml.dump, deepcopy, include nvidia_modelopt_version)
  end
  SaveFn-->>Trainer: Continue existing save flow
Loading
sequenceDiagram
  autonumber
  participant User
  participant CLI as megatron_preprocess_data.py:main
  participant HF as HuggingFace Datasets
  participant FS as Filesystem
  participant Pre as megatron_preprocess_data

  User->>CLI: Invoke with args (input_path or dataset/subset/split, output_dir, json_keys, ...)
  alt input_path provided
    CLI->>Pre: megatron_preprocess_data(input_path_files, tokenizer, ...)
  else Fetch from HF
    CLI->>HF: load_dataset(dataset, subset, split(s))
    HF-->>CLI: Dataset(s)
    loop for each split
      CLI->>CLI: Validate json_keys against dataset.features
      alt keys valid
        CLI->>FS: Write output_dir/{dataset}_{subset}_{split}.jsonl
        CLI->>CLI: Append path to input list
      else invalid
        CLI->>CLI: Skip split
      end
    end
    CLI->>Pre: megatron_preprocess_data(collected input files, tokenizer, ...)
  end
  Pre-->>CLI: Tokenization and binary output complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

Thump-thump I tidy keys and strings, a rabbit with a pen,
I drop the noisy substrings, nibble bytes and YAML then.
I fetch splits, I write the lines, I call the tokenizer’s art,
A hop, a dump, the pipeline hums — I’m joyful in my part. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the two primary changes in the PR — adding modelopt_run_config.yaml and introducing a main/CLI for Megatron data preprocessing — and directly reflects the changes described in the diff and PR objectives, making it clear and relevant for reviewers scanning history. It is concise, specific, and free of noise or vague phrasing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenhany/mcore_run_config_yml

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ChenhanYu ChenhanYu changed the title Adding modelopt_run_config.yaml and a main function for data Adding modelopt_run_config.yaml and a main function for megatron data preprocessing Sep 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (3)

88-94: Appending EOD can exceed max_sequence_length

Reserve 1 slot for EOS when truncating.

Apply this diff:

-            enc_len += len(encoded)
-            if self.max_sequence_length is not None:
-                encoded = encoded[: self.max_sequence_length]
+            enc_len += len(encoded)
+            if self.max_sequence_length is not None:
+                max_len = self.max_sequence_length - (1 if self.append_eod else 0)
+                encoded = encoded[: max(0, max_len)]
-            if len(encoded) > 0 and self.append_eod:
-                encoded.append(_Encoder.tokenizer.eos_token_id)
+            if self.append_eod:
+                encoded.append(_Encoder.tokenizer.eos_token_id)

119-124: Pool leak and potential spawn pickling issue; use context manager and top-level initializer

Close the Pool deterministically and avoid bound-method initializer.

Apply this diff:

-        fin = open(input_file_name, encoding="utf-8")
-
-        pool = multiprocessing.Pool(self.workers, initializer=encoder.initializer)
-        encoded_docs = pool.imap(encoder.encode, fin, 32)
+        with open(input_file_name, encoding="utf-8") as fin:
+            with multiprocessing.Pool(
+                self.workers,
+                initializer=_pool_init,
+                initargs=(encoder.tokenizer_name_or_path,),
+            ) as pool:
+                encoded_docs = pool.imap(encoder.encode, fin, 32)

Add this helper (outside this hunk) to the module:

def _pool_init(tokenizer_name_or_path: str):
    _Encoder.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)

144-152: KeyError risk when some outputs already exist + token counter bug

Builders may be missing for some keys; loop must honor active builder keys. Also final_enc_len uses a stale “key” variable.

Apply this diff:

-        total_doc_len, total_enc_len, final_enc_len = 0, 0, 0
-        for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):
+        total_doc_len, total_enc_len, final_enc_len = 0, 0, 0
+        active_keys = tuple(builders.keys())
+        for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):
             total_doc_len += doc_len
             total_enc_len += enc_len
-            final_enc_len += sum(sentence_lens[key])
-            for key in doc:
-                builders[key].add_document(doc[key], sentence_lens[key])
+            final_enc_len += sum(sum(sentence_lens[k]) for k in active_keys if k in sentence_lens)
+            for k in active_keys:
+                builders[k].add_document(doc[k], sentence_lens[k])
             self._print_processing_stats(i, total_doc_len, total_enc_len)
🧹 Nitpick comments (2)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)

25-25: Avoid hard dependency on PyYAML at import time

Move yaml import to the write-path to prevent module import errors when this code path isn’t used.

Apply this diff:

-import yaml
+# (lazy import inside save_sharded_modelopt_state to avoid hard dependency)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

108-109: Guard modulo when log_interval is 0

Avoid ZeroDivisionError.

Apply this diff:

-        if count % self.log_interval == 0:
+        if self.log_interval and count % self.log_interval == 0:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c7c16e and e1f1bbb.

📒 Files selected for processing (2)
  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (2 hunks)
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)
modelopt/torch/utils/distributed.py (1)
  • is_master (75-77)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (2)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

34-41: Declare runtime deps: requests, datasets

  • File: modelopt/torch/utils/plugins/megatron_preprocess_data.py (lines 34–41) imports requests and datasets — add these to runtime dependency manifests (requirements.txt, pyproject.toml, setup.cfg install_requires) and document their inclusion.
  • Verification: ran the supplied script; it returned no output so manifests could not be located in the sandbox. Manually verify dependency manifests and add entries if missing.
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)

143-149: Confirm dependency declaration for PyYAML

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py imports yaml and writes a YAML file at runtime (lines 143–149); ensure 'pyyaml' is declared in pyproject.toml / setup.py / requirements.txt — my automated scan hit encoding errors and couldn't reliably verify manifests, so confirm and add if missing.

Copy link

codecov bot commented Sep 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.82%. Comparing base (bbb2304) to head (2caf0fa).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #341   +/-   ##
=======================================
  Coverage   73.82%   73.82%           
=======================================
  Files         172      172           
  Lines       17438    17438           
=======================================
  Hits        12874    12874           
  Misses       4564     4564           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if args.input_path is None:
args.input_path = []
if args.dataset is None:
args.dataset = "nvidia/Nemotron-Pretraining-Dataset-sample"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not set --dataset default in the parser above?

split = entry["split"]

if args.subset is not None and args.subset != subset:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace line 253 and 255 with skip_process=True to make this more clear

Signed-off-by: Chenhan Yu <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (3)

231-231: Boolean flag parsing: resolved as suggested

Switch to action="store_true" looks good and addresses earlier feedback.


247-248: Guard JSON parsing and missing "splits" key

Accessing response.json()["splits"] assumes happy path. Guard and early‑out if empty.

Apply this diff:

-        for entry in response.json()["splits"]:
+        data = response.json()
+        splits = data.get("splits", [])
+        if not splits:
+            print(f"No splits found for dataset {args.dataset}; server response keys: {list(data.keys())}")
+            return
+        for entry in splits:

239-246: Create output_dir before writing JSONL files

You write JSONL files before megatron_preprocess_data() creates the directory. Ensure it exists here to avoid runtime failures.

Apply this diff:

     if args.input_path is None:
         args.input_path = []
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
 
-        response = requests.get(
-            "https://datasets-server.huggingface.co/splits?dataset={}".format(args.dataset),
-            timeout=10,
-        )
+        # Fetch splits with robust error handling
+        try:
+            response = requests.get(
+                "https://datasets-server.huggingface.co/splits",
+                params={"dataset": args.dataset},
+                timeout=10,
+            )
+            response.raise_for_status()
+        except requests.RequestException as e:
+            print(f"Failed to fetch dataset splits for {args.dataset}: {e}")
+            return
🧹 Nitpick comments (6)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (6)

270-273: Sanitize filename components and use Path joins

Dataset names contain “/” and subset may be None. Current concatenation creates nested dirs and “None” in filenames.

Apply this diff:

-            json_file_path = args.output_dir + "/" + name + "_" + subset + "_" + split + ".jsonl"
-            dataset.to_json(json_file_path)
-            args.input_path += [json_file_path]
+            safe_name = name.replace("/", "__")
+            safe_subset = subset or "default"
+            json_file_path = Path(args.output_dir) / f"{safe_name}_{safe_subset}_{split}.jsonl"
+            dataset.to_json(str(json_file_path))
+            args.input_path.append(str(json_file_path))

119-124: Close resources deterministically (file + Pool)

Open file handle and multiprocessing.Pool aren’t context-managed. Use context managers to avoid leaks on exceptions.

Apply this diff:

-        print("Opening", input_file_name)
-        fin = open(input_file_name, encoding="utf-8")
-
-        pool = multiprocessing.Pool(self.workers, initializer=encoder.initializer)
-        encoded_docs = pool.imap(encoder.encode, fin, 32)
+        print("Opening", input_file_name)
+        with open(input_file_name, encoding="utf-8") as fin, \
+             multiprocessing.Pool(self.workers, initializer=encoder.initializer) as pool:
+            encoded_docs = pool.imap(encoder.encode, fin, 32)
@@
-        fin.close()
-        for key in builders:
+        for key in builders:
             builders[key].finalize(output_idx_files[key])
 
         return final_enc_len

Also applies to: 153-156


192-194: mkdir should create parent dirs

Use parents=True to handle nested output paths.

Apply this diff:

-    Path(output_dir).mkdir(exist_ok=True)
+    Path(output_dir).mkdir(parents=True, exist_ok=True)

67-69: Avoid double-loading the tokenizer (performance/cold-start)

Tokenizer is loaded once to read vocab_size and again in worker initializers. Caching helps, but startup is still duplicated. Consider:

  • Load once in parent, read vocab_size, and pass tokenizer_name_or_path plus env var TRANSFORMERS_OFFLINE=1 or local_files_only=True in workers; or
  • Add an optional vocab_size arg to bypass the first load (derive from tokenizer config if available).

Also applies to: 193-195


61-63: Heuristic max_document_length (8x) — confirm intent

Hardcoding max_document_length = max_sequence_length * 8 is opaque. Either document the rationale or expose it as a CLI arg (e.g., --document_length_multiplier).


261-266: Key validation against dataset columns can be more robust

Some datasets have nested or dict features; dataset.features may not reflect flattened column names. Prefer checking key in dataset.column_names and optionally warn with available columns.

Apply this diff:

-            for key in args.json_keys:
-                if key not in dataset.features:
+            for key in args.json_keys:
+                if key not in dataset.column_names:
                     print(f"Key {key} not found in dataset features. Skipping...")
                     skip_processing = True
                     break
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1f1bbb and 9e7c1a4.

📒 Files selected for processing (2)
  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (3 hunks)
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
modelopt/onnx/quantization/__main__.py (1)
  • main (255-294)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality

Comment on lines +225 to +226
parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
parser.add_argument(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix default --split: make it None to process all splits when omitted

Current default "train" contradicts the PR description (“all subsets and splits if subset/split omitted”) and filters to train-only. Set default to None and keep the existing filter logic.

Apply this diff:

-    parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
+    parser.add_argument(
+        "--split",
+        type=str,
+        default=None,
+        help="Hugging Face Hub dataset split (process all if omitted)",
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
parser.add_argument(
parser.add_argument(
"--split",
type=str,
default=None,
help="Hugging Face Hub dataset split (process all if omitted)",
)
parser.add_argument(
🤖 Prompt for AI Agents
In modelopt/torch/utils/plugins/megatron_preprocess_data.py around lines
225-226, the --split argument currently defaults to "train" which incorrectly
limits processing to the train split; change the parser.add_argument for
"--split" to use default=None so that when omitted the existing logic will
process all splits/subsets, leaving the current filter logic unchanged.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (4)

231-231: Nice fix: --append_eod uses store_true.


225-226: Default --split should be None to process all splits when omitted (matches PR description).

-    parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
+    parser.add_argument(
+        "--split",
+        type=str,
+        default=None,
+        help="Hugging Face Hub dataset split (process all if omitted)",
+    )

239-241: Create output_dir before writing JSONL files.

Writing JSONL happens before megatron_preprocess_data() creates the dir; this can fail.

     if args.input_path is None:
         args.input_path = []
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)

275-277: Filename/path bug: dataset names contain “/” and subset can be None.

Current string concat can create unintended subdirs or “None” in filenames and fail. Sanitize components and use Path join; also write JSONL explicitly.

-            json_file_path = args.output_dir + "/" + name + "_" + subset + "_" + split + ".jsonl"
-            dataset.to_json(json_file_path)
-            args.input_path += [json_file_path]
+            safe_name = name.replace("/", "__")
+            safe_subset = subset or "default"
+            json_file_path = Path(args.output_dir) / f"{safe_name}_{safe_subset}_{split}.jsonl"
+            dataset.to_json(str(json_file_path), lines=True)
+            args.input_path.append(str(json_file_path))
🧹 Nitpick comments (4)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (4)

40-41: Make requests/datasets optional: move imports inside main().

Avoid forcing library users to install optional CLI deps. Lazy-import inside main().

-import requests
-from datasets import load_dataset
@@
 def main():
@@
-    parser = argparse.ArgumentParser(prog="megatron_preprocess_data")
+    # Lazy imports to keep optional CLI deps out of library import path
+    import requests
+    from datasets import load_dataset
+
+    parser = argparse.ArgumentParser(prog="megatron_preprocess_data")

Also applies to: 206-216


242-253: Harden JSON parsing for splits; guard missing/empty key.

Avoid KeyError if API shape changes and short‑circuit if no splits.

-        for entry in response.json()["splits"]:
+        data = response.json()
+        splits = data.get("splits", [])
+        if not splits:
+            print(f"No splits found for {args.dataset}.")
+            return
+        for entry in splits:

263-266: Wrap load_dataset with error handling to continue on bad splits/configs.

-            print(f"Loading dataset {name} with subset {subset} and split {split}")
-            dataset = load_dataset(name, subset, split=split)
+            print(f"Loading dataset {name} with subset {subset} and split {split}")
+            try:
+                dataset = load_dataset(name, subset, split=split)
+            except Exception as e:
+                print(f"Failed to load {name} (subset={subset}, split={split}): {e}")
+                continue

239-289: Optional: rely on datasets API to enumerate configs/splits and drop the raw HTTP call.

Using datasets.get_dataset_config_names / get_dataset_split_names removes the requests dependency and one external endpoint.

If you want, I can propose a minimal patch replacing the HTTP call with datasets helpers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e7c1a4 and 2caf0fa.

📒 Files selected for processing (1)
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

34-34: LGTM: argparse import is appropriate for the new CLI.

Comment on lines +40 to +41
import requests
from datasets import load_dataset
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Ensure packaging lists new dependencies.

requests and datasets are new runtime deps for the CLI. Verify they’re declared (pyproject/requirements/extras).

Also applies to: 206-289


🏁 Script executed:

#!/bin/bash
# Verify packaging declares 'datasets' and 'requests'
rg -nP -C2 '^(datasets|requests)\b' -g 'pyproject.toml' -g 'requirements*.txt' -g 'setup.cfg' -g 'setup.py'

Length of output: 1129


Add 'requests' and 'datasets' to project packaging/runtime deps

modelopt/torch/utils/plugins/megatron_preprocess_data.py imports requests and datasets (lines 40–41; also referenced ~206–289). Search shows only 'datasets' in examples/*/requirements.txt and no top-level declaration for either. Add both to pyproject.toml (or setup.cfg / setup.py) or the CLI extras so the CLI runtime installs them.

🤖 Prompt for AI Agents
In modelopt/torch/utils/plugins/megatron_preprocess_data.py around lines 40-41
(and usages ~206-289), the module imports and uses the third-party packages
'requests' and 'datasets' but they are not declared in the top-level project
dependencies; add both 'requests' and 'datasets' to the project's
packaging/runtime dependencies (e.g., pyproject.toml [project]/dependencies or
setup.cfg/setup.py install_requires) or include them in the CLI extras so that
the CLI runtime installs them; pick appropriate minimal version constraints
consistent with examples/*/requirements.txt (or mirror examples) and run
dependency install/tests to verify imports resolve.

@ChenhanYu ChenhanYu enabled auto-merge (squash) September 19, 2025 03:51
@ChenhanYu ChenhanYu merged commit fb875ac into main Sep 19, 2025
22 checks passed
@ChenhanYu ChenhanYu deleted the chenhany/mcore_run_config_yml branch September 19, 2025 05:06

if dist.is_master():
run_config_name = f"{checkpoint_name}/modelopt_run_config.yaml"
config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__))
Copy link

@ananthsub ananthsub Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenhanYu I'm seeing an issue with the deepcopy

these fields are set on the config during training:
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/training/setup.py#L267C5-L295

https://github.com/NVIDIA/Megatron-LM/blob/c223178a33e239312c3afd81a8adc27cd9ca698c/megatron/training/training.py#L2038-L2057

These are bound methods to DDP class instances which have pytorch process groups initialized as member variables. the deepcopy here is going into deepcopying the class itself, causing issues like this:

b-pretrain/0 [rank0]:   File "/opt/TensorRT-Model-Optimizer/modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 157, in save_sharded_modelopt_state
b-pretrain/0 [rank0]:     config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__))
b-pretrain/0 [rank0]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 226, in _deepcopy_method
b-pretrain/0 [rank0]:     return type(x)(x.__func__, deepcopy(x.__self__, memo))
b-pretrain/0 [rank0]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]:     y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]:     state = deepcopy(state, memo)
b-pretrain/0 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 196, in _deepcopy_list
b-pretrain/0 [rank0]:     append(deepcopy(a, memo))
b-pretrain/0 [rank0]:            ^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]:     y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]:     state = deepcopy(state, memo)
b-pretrain/0 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]:     y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]:     state = deepcopy(state, memo)
b-pretrain/0 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]:     y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]:     state = deepcopy(state, memo)
b-pretrain/0 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]:     y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]:     state = deepcopy(state, memo)
b-pretrain/0 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]:     y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]:     state = deepcopy(state, memo)
b-pretrain/0 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]:     y = copier(x, memo)
b-pretrain/0 [rank0]:         ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]:                              ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]:   File "/usr/lib/python3.12/copy.py", line 151, in deepcopy
b-pretrain/0 [rank0]:     rv = reductor(4)
b-pretrain/0 [rank0]:          ^^^^^^^^^^^
b-pretrain/0 [rank0]: TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

in megatron bridge, we have a yaml representer for this to serialize our configs to yaml:

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/utils/yaml_utils.py#L24-L87

so we don't see this when serializing our overall job config to yaml (which includes the transformer config):

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/training/checkpointing.py#L653

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The deepcopy here is likely not necessary. Let me remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants