SFT Nemo - Qwen by nidhihrmth · Pull Request #17 · basetenlabs/ml-cookbook

nidhihrmth · 2025-09-08T23:11:40Z

Proof:

bolasim · 2025-09-08T23:37:16Z

examples/qwen2.5_nemo/training/model/import_ckpt.py

+    # Executor type
+    parser.add_argument(
+        "--executor", 
+        type=str, 
+        default="local",
+        choices=["local", "slurm"],
+        help="Executor type to use"
+    )


we shouldn't have this since we don't support slurm, right?

bolasim · 2025-09-08T23:37:57Z

examples/qwen2.5_nemo/training/model/import_ckpt.py

+    # Model configuration
+    parser.add_argument(
+        "--model", 
+        type=str, 
+        default="qwen2_7b",
+        choices=["qwen2_7b"],  # Add more models as needed
+        help="Model type to use"
+    )


this needs to be the same between import_ckept and train, right? Is it better to explicitely set it in the run.sh and add a comment?

bolasim · 2025-09-09T22:24:01Z

examples/qwen2.5_nemo/training/config.py

+training_project = definitions.TrainingProject(
+    name="Nemo-qwen2.5-nemo 1node",
+    job=training_job
+)


can you add a basic linter to this repo and clean up all the newline ends and formating in a seprate PR? check w Nico what's preferred for our repos.

bolasim · 2025-09-09T22:25:00Z

examples/qwen2.5_nemo/training/model/train.py

+print(f"Number of GPUs: {torch.cuda.device_count()}")
+
+### Dataset
+from data import BespokeDataModule


is there a reason why we need to define this custom data class instead of relying on the pre-defined huffingface dataset one?

I tried using return run.Config(HFDatasetDataModule, path_or_dataset='bespokelabs/Bespoke-Stratos-17k', seq_length=seq_length, micro_batch_size=micro_batch_size, global_batch_size=global_batch_size, num_workers=num_workers) instead, but the run failed with missing num_micro_batches (or something like that).
If you want to try, you can replace line 15 in this file with something similar to the above

bolasim · 2025-09-09T22:25:52Z

examples/qwen2.5_nemo/training/model/train.py

+        resume_if_exists=True,
+    )
+
+def configure_finetuning_recipe(args):


why define this recipe from scratch instead of re-using the the default nemo recipe?

It is using llm.finetune from nemo, are you referring to something else?

bolasim · 2025-09-09T22:26:16Z

examples/qwen2.5_nemo/training/model/train.py

+    return run.Config(llm.Qwen2Model, config=run.Config(llm.Qwen2Config7B))
+
+# Configure the resume
+def resume(model_id: str = "Qwen/Qwen2.5-7B-Instruct") -> run.Config[nl.AutoResume]:


have we tested that this works with cache? If so, can you add a comment about expected behavior?

bolasim · 2025-09-09T22:27:43Z

examples/qwen2.5_nemo/training/model/data.py

+
+    def prepare_data(self) -> None:
+        # if train file is specified, no need to do anything
+        if not self.train_path.exists() or self.force_redownload:


this is a bit error prone for partial download failures. it'd just delegate to the huggingface load_dataset logic and rely on it for skipping download.

nemo working

55d49a5

nidhihrmth changed the title ~~nemo working~~ SFT Nemo - Qwen Sep 8, 2025

bolasim reviewed Sep 8, 2025

View reviewed changes

make model id consistent

7bfb1d0

nidhihrmth marked this pull request as ready for review September 9, 2025 21:56

bolasim approved these changes Sep 9, 2025

View reviewed changes

Merge branch 'main' into nh/nemo_qwen2.5

b496ba4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT Nemo - Qwen#17

SFT Nemo - Qwen#17
nidhihrmth wants to merge 3 commits intomainfrom
nh/nemo_qwen2.5

nidhihrmth commented Sep 8, 2025 •

edited

Loading

Uh oh!

bolasim Sep 8, 2025

Uh oh!

bolasim Sep 8, 2025

Uh oh!

bolasim Sep 9, 2025

Uh oh!

bolasim Sep 9, 2025

Uh oh!

nidhihrmth Sep 10, 2025

Uh oh!

bolasim Sep 9, 2025

Uh oh!

nidhihrmth Sep 10, 2025

Uh oh!

bolasim Sep 9, 2025

Uh oh!

bolasim Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nidhihrmth commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nidhihrmth commented Sep 8, 2025 •

edited

Loading