Conversation
| # Executor type | ||
| parser.add_argument( | ||
| "--executor", | ||
| type=str, | ||
| default="local", | ||
| choices=["local", "slurm"], | ||
| help="Executor type to use" | ||
| ) |
There was a problem hiding this comment.
we shouldn't have this since we don't support slurm, right?
| # Model configuration | ||
| parser.add_argument( | ||
| "--model", | ||
| type=str, | ||
| default="qwen2_7b", | ||
| choices=["qwen2_7b"], # Add more models as needed | ||
| help="Model type to use" | ||
| ) |
There was a problem hiding this comment.
this needs to be the same between import_ckept and train, right? Is it better to explicitely set it in the run.sh and add a comment?
| training_project = definitions.TrainingProject( | ||
| name="Nemo-qwen2.5-nemo 1node", | ||
| job=training_job | ||
| ) No newline at end of file |
There was a problem hiding this comment.
can you add a basic linter to this repo and clean up all the newline ends and formating in a seprate PR? check w Nico what's preferred for our repos.
| print(f"Number of GPUs: {torch.cuda.device_count()}") | ||
|
|
||
| ### Dataset | ||
| from data import BespokeDataModule |
There was a problem hiding this comment.
is there a reason why we need to define this custom data class instead of relying on the pre-defined huffingface dataset one?
There was a problem hiding this comment.
I tried using return run.Config(HFDatasetDataModule, path_or_dataset='bespokelabs/Bespoke-Stratos-17k', seq_length=seq_length, micro_batch_size=micro_batch_size, global_batch_size=global_batch_size, num_workers=num_workers) instead, but the run failed with missing num_micro_batches (or something like that).
If you want to try, you can replace line 15 in this file with something similar to the above
| resume_if_exists=True, | ||
| ) | ||
|
|
||
| def configure_finetuning_recipe(args): |
There was a problem hiding this comment.
why define this recipe from scratch instead of re-using the the default nemo recipe?
There was a problem hiding this comment.
It is using llm.finetune from nemo, are you referring to something else?
| return run.Config(llm.Qwen2Model, config=run.Config(llm.Qwen2Config7B)) | ||
|
|
||
| # Configure the resume | ||
| def resume(model_id: str = "Qwen/Qwen2.5-7B-Instruct") -> run.Config[nl.AutoResume]: |
There was a problem hiding this comment.
have we tested that this works with cache? If so, can you add a comment about expected behavior?
|
|
||
| def prepare_data(self) -> None: | ||
| # if train file is specified, no need to do anything | ||
| if not self.train_path.exists() or self.force_redownload: |
There was a problem hiding this comment.
this is a bit error prone for partial download failures. it'd just delegate to the huggingface load_dataset logic and rely on it for skipping download.
Proof: