Adding modelopt_run_config.yaml and a main function for megatron data preprocessing #341

ChenhanYu · 2025-09-18T18:36:38Z

What does this PR do?

Type of change: ? new feature

Overview:

We are adding an additional file modelopt_run_config.yaml to the Megatron-LM or Megatron-Bridge checkpoint that we process. It is essentially a dump of the TransformerConfig to YAML which can be handy for knowledge distillation. NeMo2 checkpoint has a similar file model_config.yaml and Megatron-Brdige has run_config.yaml. So it is mainly useful for Megatron-LM.
We add a main function for megatron data preprocess. The minimum input is --tokenizer which accepts a PreTrainedTokenizer path. It can process datasets from HF hub directly by using --dataset. If subset and split are not provided, all subsets and splits will be processed.

Usage

 python megatron_preprocess_data.py \
     --dataset "nvidia/Nemotron-Pretraining-Dataset-sample" \
     --tokenizer"meta-llama/Llama-3.2-1B-Instruct" \
     --output_dir "./processed_data"```

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Checkpoint saves now include a companion run-configuration YAML (sanitized model settings + library version) alongside checkpoints to improve reproducibility; file is written only by the primary process.
- New command-line interface for data preprocessing: fetch from Hugging Face or use local files, validate required fields, auto-generate JSONL inputs, and run tokenization/binarization with configurable dataset, split, tokenizer, keys, sequence length, workers, logging, and EOD handling.

preprocessing Signed-off-by: Chenhan Yu <[email protected]>

coderabbitai · 2025-09-18T18:36:47Z

Walkthrough

Adds master-only YAML export of a sanitized transformer config during distributed checkpoint saving (filters keys, stringifies values, and writes modelopt_run_config.yaml), and adds a CLI entrypoint for Megatron preprocessing that can fetch HuggingFace dataset splits, validate json keys, write per-split JSONL files, and invoke the existing preprocessing function.

Changes

Cohort / File(s)	Summary of Changes
Dist checkpoint YAML export `modelopt/torch/opt/plugins/mcore_dist_checkpointing.py`	Adds `import yaml` and a global `DROP_SUBSTRINGS` list; introduces a nested `_parse_transformer_config(transformer_config: dict) -> dict` to sanitize/stringify config fields (keep primitives, stringify others, drop keys containing substrings); on master (`dist.is_master()`) writes `{checkpoint_name}/modelopt_run_config.yaml` using a `deepcopy` of `model[0].config.__dict__` plus `nvidia_modelopt_version`; uses `yaml.dump`; no public API changes.
Megatron preprocessing CLI integration `modelopt/torch/utils/plugins/megatron_preprocess_data.py`	Adds `main()` and `if __name__ == "__main__"` entry; imports `argparse`, `requests`, and `datasets.load_dataset`; CLI accepts `input_path` or HF dataset params (dataset, subset, split), `output_dir`, tokenizer and preprocessing args, `json_keys`, etc.; if no `input_path`, downloads/validates splits from HuggingFace, writes per-split `output_dir/{dataset}_{subset}_{split}.jsonl`, collects files and then calls existing `megatron_preprocess_data`; core preprocessing function unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Trainer
  participant SaveFn as save_sharded_modelopt_state
  participant Model
  participant FS as Filesystem

  User->>Trainer: Trigger distributed checkpoint save
  Trainer->>SaveFn: save_sharded_modelopt_state(model, checkpoint_name, ...)
  SaveFn->>Model: Read model[0].config.__dict__
  SaveFn->>SaveFn: _parse_transformer_config(config) [filter/drop/stringify]
  alt Master process only
    SaveFn->>FS: Write {checkpoint_name}/modelopt_run_config.yaml (yaml.dump, deepcopy, include nvidia_modelopt_version)
  end
  SaveFn-->>Trainer: Continue existing save flow

sequenceDiagram
  autonumber
  participant User
  participant CLI as megatron_preprocess_data.py:main
  participant HF as HuggingFace Datasets
  participant FS as Filesystem
  participant Pre as megatron_preprocess_data

  User->>CLI: Invoke with args (input_path or dataset/subset/split, output_dir, json_keys, ...)
  alt input_path provided
    CLI->>Pre: megatron_preprocess_data(input_path_files, tokenizer, ...)
  else Fetch from HF
    CLI->>HF: load_dataset(dataset, subset, split(s))
    HF-->>CLI: Dataset(s)
    loop for each split
      CLI->>CLI: Validate json_keys against dataset.features
      alt keys valid
        CLI->>FS: Write output_dir/{dataset}_{subset}_{split}.jsonl
        CLI->>CLI: Append path to input list
      else invalid
        CLI->>CLI: Skip split
      end
    end
    CLI->>Pre: megatron_preprocess_data(collected input files, tokenizer, ...)
  end
  Pre-->>CLI: Tokenization and binary output complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

Thump-thump I tidy keys and strings, a rabbit with a pen,
I drop the noisy substrings, nibble bytes and YAML then.
I fetch splits, I write the lines, I call the tokenizer’s art,
A hop, a dump, the pipeline hums — I’m joyful in my part. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately summarizes the two primary changes in the PR — adding modelopt_run_config.yaml and introducing a main/CLI for Megatron data preprocessing — and directly reflects the changes described in the diff and PR objectives, making it clear and relevant for reviewers scanning history. It is concise, specific, and free of noise or vague phrasing.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenhany/mcore_run_config_yml

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (3)

88-94: Appending EOD can exceed max_sequence_length

Reserve 1 slot for EOS when truncating.

Apply this diff:

-            enc_len += len(encoded)
-            if self.max_sequence_length is not None:
-                encoded = encoded[: self.max_sequence_length]
+            enc_len += len(encoded)
+            if self.max_sequence_length is not None:
+                max_len = self.max_sequence_length - (1 if self.append_eod else 0)
+                encoded = encoded[: max(0, max_len)]
-            if len(encoded) > 0 and self.append_eod:
-                encoded.append(_Encoder.tokenizer.eos_token_id)
+            if self.append_eod:
+                encoded.append(_Encoder.tokenizer.eos_token_id)

119-124: Pool leak and potential spawn pickling issue; use context manager and top-level initializer

Close the Pool deterministically and avoid bound-method initializer.

Apply this diff:

-        fin = open(input_file_name, encoding="utf-8")
-
-        pool = multiprocessing.Pool(self.workers, initializer=encoder.initializer)
-        encoded_docs = pool.imap(encoder.encode, fin, 32)
+        with open(input_file_name, encoding="utf-8") as fin:
+            with multiprocessing.Pool(
+                self.workers,
+                initializer=_pool_init,
+                initargs=(encoder.tokenizer_name_or_path,),
+            ) as pool:
+                encoded_docs = pool.imap(encoder.encode, fin, 32)

Add this helper (outside this hunk) to the module:

def _pool_init(tokenizer_name_or_path: str):
    _Encoder.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)

144-152: KeyError risk when some outputs already exist + token counter bug

Builders may be missing for some keys; loop must honor active builder keys. Also final_enc_len uses a stale “key” variable.

Apply this diff:

-        total_doc_len, total_enc_len, final_enc_len = 0, 0, 0
-        for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):
+        total_doc_len, total_enc_len, final_enc_len = 0, 0, 0
+        active_keys = tuple(builders.keys())
+        for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):
             total_doc_len += doc_len
             total_enc_len += enc_len
-            final_enc_len += sum(sentence_lens[key])
-            for key in doc:
-                builders[key].add_document(doc[key], sentence_lens[key])
+            final_enc_len += sum(sum(sentence_lens[k]) for k in active_keys if k in sentence_lens)
+            for k in active_keys:
+                builders[k].add_document(doc[k], sentence_lens[k])
             self._print_processing_stats(i, total_doc_len, total_enc_len)

🧹 Nitpick comments (2)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)
25-25: Avoid hard dependency on PyYAML at import time

Move yaml import to the write-path to prevent module import errors when this code path isn’t used.

Apply this diff:
-import yaml
+# (lazy import inside save_sharded_modelopt_state to avoid hard dependency)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
108-109: Guard modulo when log_interval is 0

Avoid ZeroDivisionError.

Apply this diff:
-        if count % self.log_interval == 0:
+        if self.log_interval and count % self.log_interval == 0:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c7c16e and e1f1bbb.

📒 Files selected for processing (2)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (2 hunks)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)

modelopt/torch/utils/distributed.py (1)

is_master (75-77)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (2)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

34-41: Declare runtime deps: requests, datasets

File: modelopt/torch/utils/plugins/megatron_preprocess_data.py (lines 34–41) imports requests and datasets — add these to runtime dependency manifests (requirements.txt, pyproject.toml, setup.cfg install_requires) and document their inclusion.

Verification: ran the supplied script; it returned no output so manifests could not be located in the sandbox. Manually verify dependency manifests and add entries if missing.

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)

143-149: Confirm dependency declaration for PyYAML

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py imports yaml and writes a YAML file at runtime (lines 143–149); ensure 'pyyaml' is declared in pyproject.toml / setup.py / requirements.txt — my automated scan hit encoding errors and couldn't reliably verify manifests, so confirm and add if missing.

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

modelopt/torch/utils/plugins/megatron_preprocess_data.py

codecov · 2025-09-18T18:49:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.82%. Comparing base (bbb2304) to head (2caf0fa).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #341   +/-   ##
=======================================
  Coverage   73.82%   73.82%           
=======================================
  Files         172      172           
  Lines       17438    17438           
=======================================
  Hits        12874    12874           
  Misses       4564     4564

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

modelopt/torch/utils/plugins/megatron_preprocess_data.py

jenchen13 · 2025-09-18T19:02:17Z

modelopt/torch/utils/plugins/megatron_preprocess_data.py

+    if args.input_path is None:
+        args.input_path = []
+        if args.dataset is None:
+            args.dataset = "nvidia/Nemotron-Pretraining-Dataset-sample"


why not set --dataset default in the parser above?

jenchen13 · 2025-09-18T19:05:24Z

modelopt/torch/utils/plugins/megatron_preprocess_data.py

+            split = entry["split"]
+
+            if args.subset is not None and args.subset != subset:
+                continue


can you replace line 253 and 255 with skip_process=True to make this more clear

Signed-off-by: Chenhan Yu <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (3)

231-231: Boolean flag parsing: resolved as suggested

Switch to action="store_true" looks good and addresses earlier feedback.

247-248: Guard JSON parsing and missing "splits" key

Accessing response.json()["splits"] assumes happy path. Guard and early‑out if empty.

Apply this diff:

-        for entry in response.json()["splits"]:
+        data = response.json()
+        splits = data.get("splits", [])
+        if not splits:
+            print(f"No splits found for dataset {args.dataset}; server response keys: {list(data.keys())}")
+            return
+        for entry in splits:

239-246: Create output_dir before writing JSONL files

You write JSONL files before megatron_preprocess_data() creates the directory. Ensure it exists here to avoid runtime failures.

Apply this diff:

     if args.input_path is None:
         args.input_path = []
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
 
-        response = requests.get(
-            "https://datasets-server.huggingface.co/splits?dataset={}".format(args.dataset),
-            timeout=10,
-        )
+        # Fetch splits with robust error handling
+        try:
+            response = requests.get(
+                "https://datasets-server.huggingface.co/splits",
+                params={"dataset": args.dataset},
+                timeout=10,
+            )
+            response.raise_for_status()
+        except requests.RequestException as e:
+            print(f"Failed to fetch dataset splits for {args.dataset}: {e}")
+            return

🧹 Nitpick comments (6)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (6)
270-273: Sanitize filename components and use Path joins

Dataset names contain “/” and subset may be None. Current concatenation creates nested dirs and “None” in filenames.

Apply this diff:
-            json_file_path = args.output_dir + "/" + name + "_" + subset + "_" + split + ".jsonl"
-            dataset.to_json(json_file_path)
-            args.input_path += [json_file_path]
+            safe_name = name.replace("/", "__")
+            safe_subset = subset or "default"
+            json_file_path = Path(args.output_dir) / f"{safe_name}_{safe_subset}_{split}.jsonl"
+            dataset.to_json(str(json_file_path))
+            args.input_path.append(str(json_file_path))
119-124: Close resources deterministically (file + Pool)

Open file handle and multiprocessing.Pool aren’t context-managed. Use context managers to avoid leaks on exceptions.

Apply this diff:
-        print("Opening", input_file_name)
-        fin = open(input_file_name, encoding="utf-8")
-
-        pool = multiprocessing.Pool(self.workers, initializer=encoder.initializer)
-        encoded_docs = pool.imap(encoder.encode, fin, 32)
+        print("Opening", input_file_name)
+        with open(input_file_name, encoding="utf-8") as fin, \
+             multiprocessing.Pool(self.workers, initializer=encoder.initializer) as pool:
+            encoded_docs = pool.imap(encoder.encode, fin, 32)
@@
-        fin.close()
-        for key in builders:
+        for key in builders:
             builders[key].finalize(output_idx_files[key])
 
         return final_enc_len
Also applies to: 153-156

192-194: mkdir should create parent dirs

Use parents=True to handle nested output paths.

Apply this diff:
-    Path(output_dir).mkdir(exist_ok=True)
+    Path(output_dir).mkdir(parents=True, exist_ok=True)
67-69: Avoid double-loading the tokenizer (performance/cold-start)

Tokenizer is loaded once to read vocab_size and again in worker initializers. Caching helps, but startup is still duplicated. Consider:

Load once in parent, read vocab_size, and pass tokenizer_name_or_path plus env var TRANSFORMERS_OFFLINE=1 or local_files_only=True in workers; or

Add an optional vocab_size arg to bypass the first load (derive from tokenizer config if available).

Also applies to: 193-195

61-63: Heuristic max_document_length (8x) — confirm intent

Hardcoding max_document_length = max_sequence_length * 8 is opaque. Either document the rationale or expose it as a CLI arg (e.g., --document_length_multiplier).

261-266: Key validation against dataset columns can be more robust

Some datasets have nested or dict features; dataset.features may not reflect flattened column names. Prefer checking key in dataset.column_names and optionally warn with available columns.

Apply this diff:
-            for key in args.json_keys:
-                if key not in dataset.features:
+            for key in args.json_keys:
+                if key not in dataset.column_names:
                     print(f"Key {key} not found in dataset features. Skipping...")
                     skip_processing = True
                     break

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1f1bbb and 9e7c1a4.

📒 Files selected for processing (2)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (3 hunks)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

modelopt/onnx/quantization/__main__.py (1)

main (255-294)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

coderabbitai · 2025-09-18T21:37:11Z

modelopt/torch/utils/plugins/megatron_preprocess_data.py

+    parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
+    parser.add_argument(


⚠️ Potential issue

Fix default --split: make it None to process all splits when omitted

Current default "train" contradicts the PR description (“all subsets and splits if subset/split omitted”) and filters to train-only. Set default to None and keep the existing filter logic.

Apply this diff:

- parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split") + parser.add_argument( + "--split", + type=str, + default=None, + help="Hugging Face Hub dataset split (process all if omitted)", + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")

parser.add_argument(

parser.add_argument(

"--split",

type=str,

default=None,

help="Hugging Face Hub dataset split (process all if omitted)",

)

parser.add_argument(

🤖 Prompt for AI Agents

In modelopt/torch/utils/plugins/megatron_preprocess_data.py around lines 225-226, the --split argument currently defaults to "train" which incorrectly limits processing to the train split; change the parser.add_argument for "--split" to use default=None so that when omitted the existing logic will process all splits/subsets, leaving the current filter logic unchanged.

Signed-off-by: Chenhan Yu <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (4)
231-231: Nice fix: --append_eod uses store_true.

225-226: Default --split should be None to process all splits when omitted (matches PR description).
-    parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
+    parser.add_argument(
+        "--split",
+        type=str,
+        default=None,
+        help="Hugging Face Hub dataset split (process all if omitted)",
+    )
239-241: Create output_dir before writing JSONL files.

Writing JSONL happens before megatron_preprocess_data() creates the dir; this can fail.
     if args.input_path is None:
         args.input_path = []
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
275-277: Filename/path bug: dataset names contain “/” and subset can be None.

Current string concat can create unintended subdirs or “None” in filenames and fail. Sanitize components and use Path join; also write JSONL explicitly.
-            json_file_path = args.output_dir + "/" + name + "_" + subset + "_" + split + ".jsonl"
-            dataset.to_json(json_file_path)
-            args.input_path += [json_file_path]
+            safe_name = name.replace("/", "__")
+            safe_subset = subset or "default"
+            json_file_path = Path(args.output_dir) / f"{safe_name}_{safe_subset}_{split}.jsonl"
+            dataset.to_json(str(json_file_path), lines=True)
+            args.input_path.append(str(json_file_path))

🧹 Nitpick comments (4)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (4)
40-41: Make requests/datasets optional: move imports inside main().

Avoid forcing library users to install optional CLI deps. Lazy-import inside main().
-import requests
-from datasets import load_dataset
@@
 def main():
@@
-    parser = argparse.ArgumentParser(prog="megatron_preprocess_data")
+    # Lazy imports to keep optional CLI deps out of library import path
+    import requests
+    from datasets import load_dataset
+
+    parser = argparse.ArgumentParser(prog="megatron_preprocess_data")
Also applies to: 206-216

242-253: Harden JSON parsing for splits; guard missing/empty key.

Avoid KeyError if API shape changes and short‑circuit if no splits.
-        for entry in response.json()["splits"]:
+        data = response.json()
+        splits = data.get("splits", [])
+        if not splits:
+            print(f"No splits found for {args.dataset}.")
+            return
+        for entry in splits:
263-266: Wrap load_dataset with error handling to continue on bad splits/configs.
-            print(f"Loading dataset {name} with subset {subset} and split {split}")
-            dataset = load_dataset(name, subset, split=split)
+            print(f"Loading dataset {name} with subset {subset} and split {split}")
+            try:
+                dataset = load_dataset(name, subset, split=split)
+            except Exception as e:
+                print(f"Failed to load {name} (subset={subset}, split={split}): {e}")
+                continue
239-289: Optional: rely on datasets API to enumerate configs/splits and drop the raw HTTP call.

Using datasets.get_dataset_config_names / get_dataset_split_names removes the requests dependency and one external endpoint.

If you want, I can propose a minimal patch replacing the HTTP call with datasets helpers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e7c1a4 and 2caf0fa.

📒 Files selected for processing (1)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

34-34: LGTM: argparse import is appropriate for the new CLI.

coderabbitai · 2025-09-18T22:24:43Z

modelopt/torch/utils/plugins/megatron_preprocess_data.py

+import requests
+from datasets import load_dataset


⚠️ Potential issue

🧩 Analysis chain

Ensure packaging lists new dependencies.

requests and datasets are new runtime deps for the CLI. Verify they’re declared (pyproject/requirements/extras).

Also applies to: 206-289

🏁 Script executed:

#!/bin/bash # Verify packaging declares 'datasets' and 'requests' rg -nP -C2 '^(datasets|requests)\b' -g 'pyproject.toml' -g 'requirements*.txt' -g 'setup.cfg' -g 'setup.py'

Length of output: 1129

Add 'requests' and 'datasets' to project packaging/runtime deps

modelopt/torch/utils/plugins/megatron_preprocess_data.py imports requests and datasets (lines 40–41; also referenced ~206–289). Search shows only 'datasets' in examples/*/requirements.txt and no top-level declaration for either. Add both to pyproject.toml (or setup.cfg / setup.py) or the CLI extras so the CLI runtime installs them.

🤖 Prompt for AI Agents

In modelopt/torch/utils/plugins/megatron_preprocess_data.py around lines 40-41 (and usages ~206-289), the module imports and uses the third-party packages 'requests' and 'datasets' but they are not declared in the top-level project dependencies; add both 'requests' and 'datasets' to the project's packaging/runtime dependencies (e.g., pyproject.toml [project]/dependencies or setup.cfg/setup.py install_requires) or include them in the CLI extras so that the CLI runtime installs them; pick appropriate minimal version constraints consistent with examples/*/requirements.txt (or mirror examples) and run dependency install/tests to verify imports resolve.

ananthsub · 2025-09-19T21:19:06Z

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

+
+    if dist.is_master():
+        run_config_name = f"{checkpoint_name}/modelopt_run_config.yaml"
+        config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__))


@ChenhanYu I'm seeing an issue with the deepcopy

these fields are set on the config during training:
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/training/setup.py#L267C5-L295

https://github.com/NVIDIA/Megatron-LM/blob/c223178a33e239312c3afd81a8adc27cd9ca698c/megatron/training/training.py#L2038-L2057

These are bound methods to DDP class instances which have pytorch process groups initialized as member variables. the deepcopy here is going into deepcopying the class itself, causing issues like this:

b-pretrain/0 [rank0]: File "/opt/TensorRT-Model-Optimizer/modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 157, in save_sharded_modelopt_state b-pretrain/0 [rank0]: config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__)) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 226, in _deepcopy_method b-pretrain/0 [rank0]: return type(x)(x.__func__, deepcopy(x.__self__, memo)) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct b-pretrain/0 [rank0]: state = deepcopy(state, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 196, in _deepcopy_list b-pretrain/0 [rank0]: append(deepcopy(a, memo)) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct b-pretrain/0 [rank0]: state = deepcopy(state, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct b-pretrain/0 [rank0]: state = deepcopy(state, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct b-pretrain/0 [rank0]: state = deepcopy(state, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct b-pretrain/0 [rank0]: state = deepcopy(state, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct b-pretrain/0 [rank0]: state = deepcopy(state, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy b-pretrain/0 [rank0]: y = copier(x, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^ b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 151, in deepcopy b-pretrain/0 [rank0]: rv = reductor(4) b-pretrain/0 [rank0]: ^^^^^^^^^^^ b-pretrain/0 [rank0]: TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

in megatron bridge, we have a yaml representer for this to serialize our configs to yaml:

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/utils/yaml_utils.py#L24-L87

so we don't see this when serializing our overall job config to yaml (which includes the transformer config):

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/training/checkpointing.py#L653

I see. The deepcopy here is likely not necessary. Let me remove it.

new: adding modelopt_run_config.yaml and a main function for data

e1f1bbb

preprocessing Signed-off-by: Chenhan Yu <[email protected]>

ChenhanYu requested review from a team as code owners September 18, 2025 18:36

ChenhanYu requested review from RalphMao and kevalmorabia97 September 18, 2025 18:36

ChenhanYu self-assigned this Sep 18, 2025

ChenhanYu changed the title ~~Adding modelopt_run_config.yaml and a main function for data~~ Adding modelopt_run_config.yaml and a main function for megatron data preprocessing Sep 18, 2025

ChenhanYu requested review from AAnoosheh, yueshen2016 and jenchen13 September 18, 2025 18:45

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

jenchen13 approved these changes Sep 18, 2025

View reviewed changes

fix: feedback

9e7c1a4

Signed-off-by: Chenhan Yu <[email protected]>

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

chg: adopt the coderabbit suggestion

2caf0fa

Signed-off-by: Chenhan Yu <[email protected]>

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

ChenhanYu enabled auto-merge (squash) September 19, 2025 03:51

kevalmorabia97 approved these changes Sep 19, 2025

View reviewed changes

ChenhanYu merged commit fb875ac into main Sep 19, 2025
22 checks passed

ChenhanYu deleted the chenhany/mcore_run_config_yml branch September 19, 2025 05:06

ananthsub reviewed Sep 19, 2025

View reviewed changes

ChenhanYu mentioned this pull request Sep 19, 2025

Avoid deepcopy megatron model config #349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding modelopt_run_config.yaml and a main function for megatron data preprocessing #341

Adding modelopt_run_config.yaml and a main function for megatron data preprocessing #341

ChenhanYu commented Sep 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

jenchen13 Sep 18, 2025

Uh oh!

jenchen13 Sep 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 18, 2025

Uh oh!

Uh oh!

ananthsub Sep 19, 2025 •

edited

Loading

Uh oh!

ChenhanYu Sep 19, 2025

Uh oh!

Uh oh!

		parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
		parser.add_argument(

Adding modelopt_run_config.yaml and a main function for megatron data preprocessing #341

Adding modelopt_run_config.yaml and a main function for megatron data preprocessing #341

Conversation

ChenhanYu commented Sep 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

jenchen13 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jenchen13 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ananthsub Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenhanYu Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenhanYu commented Sep 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 18, 2025 •

edited

Loading

codecov bot commented Sep 18, 2025 •

edited

Loading

ananthsub Sep 19, 2025 •

edited

Loading