Skip to content

Commit b32dc1f

Browse files
akoumpajgerh
andauthored
docs: update pretraining docs (#981)
Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: jgerh <[email protected]>
1 parent 6b8367d commit b32dc1f

File tree

9 files changed

+1178
-1169
lines changed

9 files changed

+1178
-1169
lines changed

docs/guides/dataset-overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Dataset Overview: LLM and VLM Datasets in NeMo Automodel
22

3-
This page summarizes the datasets already supported in NeMo Automodel for LLM and VLM, and shows how to plug in your own datasets via simple Python functions or purely through YAML using the `_target_` mechanism.
3+
This page summarizes the datasets already supported in NeMo Automodel for LLM and VLM, and shows how to plug in your own datasets using simple Python functions or directly through YAML using the `_target_` mechanism.
44

55
- See also: [LLM datasets](llm/dataset.md) and [VLM datasets](vlm/dataset.md) for deeper, task-specific guides.
66

@@ -84,7 +84,7 @@ dataset:
8484
split: "0.99, 0.01, 0.00" # train, validation, test
8585
splits_to_build: "train"
8686
```
87-
- See the detailed pretraining guide, [Megatron MCore Pretraining](llm/mcore-pretraining.md), which uses MegatronPretraining data.
87+
- See the detailed pretraining guide, [Megatron Core Dataset Pretraining](llm/pretraining.md), which uses MegatronPretraining data.
8888

8989
> ⚠️ Note: Multi-turn conversational and tool-calling/function-calling dataset support is coming soon.
9090

docs/guides/llm/mcore-pretraining.md

Lines changed: 0 additions & 745 deletions
This file was deleted.

docs/guides/llm/nanogpt-pretraining.md

Lines changed: 469 additions & 0 deletions
Large diffs are not rendered by default.

docs/guides/llm/pretraining.md

Lines changed: 695 additions & 410 deletions
Large diffs are not rendered by default.

docs/guides/overview.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
## Recipes and E2E Examples
1+
## Recipes and End-to-End Examples
22

33
NeMo Automodel is organized around two key concepts: recipes and components.
44

5-
Recipes are executable scripts configured via YAML files. Each recipe defines its own training and validation loop, orchestrated through a `step_scheduler`. It specifies the model, dataset, loss function, optimizer and scheduler, checkpointing, and distributed training settings—allowing end-to-end training with a single command.
5+
Recipes are executable scripts configured with YAML files. Each recipe defines its own training and validation loop, orchestrated through a `step_scheduler`. It specifies the model, dataset, loss function, optimizer, scheduler, checkpointing, and distributed training settings—allowing end-to-end training with a single command.
66

77
Components are modular, plug-and-play building blocks referenced using the `_target_` field. These include models, datasets, loss functions, and distribution managers. Recipes assemble these components, making it easy to swap them out to change precision, distribution strategy, dataset, or task—without modifying the training loop itself.
88

@@ -14,7 +14,7 @@ This page maps the ready-to-run recipes found in the `examples/` directory to th
1414
## Large Language Models (LLM)
1515
This section provides practical recipes and configurations for working with large language models across three core workflows: fine-tuning, pretraining, and knowledge distillation.
1616

17-
### Fine-tuning
17+
### Fine-Tuning
1818

1919
End-to-end fine-tuning recipes for many open models. Each subfolder contains YAML configurations showing task setups (e.g., SQuAD, HellaSwag), precision options (e.g., FP8), and parameter-efficient methods (e.g., LoRA/QLoRA).
2020

@@ -30,7 +30,7 @@ Starter configurations and scripts for pretraining with datasets from different
3030
- Example models: GPT-2 baseline, NanoGPT, DeepSeek-V3, Moonlight 16B TE (Slurm)
3131
- How-to guides:
3232
- [LLM pretraining](llm/pretraining.md)
33-
- [Pretraining with Megatron-Core datasets](llm/mcore-pretraining.md)
33+
- [Pretraining with NanoGPT](llm/nanogpt-pretraining.md)
3434

3535
### Knowledge Distillation (KD)
3636

@@ -49,11 +49,11 @@ Curated configurations for benchmarking different training stacks and settings (
4949

5050

5151
## Vision Language Models (VLM)
52-
This section provides practical recipes and configurations for working with vision-language models, covering fine-tuning and generation workflows for multimodal tasks.
52+
This section provides practical recipes and configurations for working with vision language models, covering fine-tuning and generation workflows for multimodal tasks.
5353

54-
### Fine-tuning
54+
### Fine-Tuning
5555

56-
Vision-language model fine-tuning recipes.
56+
Fine-tuning recipes for VLMs.
5757

5858
- Folder: [examples/vlm_finetune](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/vlm_finetune)
5959
- Representative family: Gemma 3 (various configurations)
@@ -73,4 +73,4 @@ WAN 2.2 example for diffusion-based image generation.
7373

7474
---
7575

76-
If you are new to the project, begin with the [Installation](installation.md) guide. Then, select a recipe category above and follow its linked how-to guide(s). The provided YAML configurations can serve as templates—customize them by adapting model names, datasets, and precision settings to match your specific needs.
76+
If you are new to the project, begin with the [Installation](installation.md) guide. Then, select a recipe category above and follow its linked how-to guide(s). The provided YAML configurations can serve as templates—customize them by adapting model names, datasets, and precision settings to match your specific needs.

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,8 @@ Fine-tune Hugging Face Models Instantly with Day-0 Support with NVIDIA NeMo Auto
3131
guides/overview.md
3232
guides/llm/finetune.md
3333
guides/llm/toolcalling.md
34-
guides/llm/mcore-pretraining.md
3534
guides/llm/pretraining.md
35+
guides/llm/nanogpt-pretraining.md
3636
guides/llm/sequence-classification.md
3737
guides/omni/gemma3-3n.md
3838
```

examples/llm_pretrain/megatron_pretrain_gpt2.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515

1616
# To run this recipe, please use the following command:
17-
# torchrun --nproc-per-node=8 recipes/llm_pretrain/pretrain.py --config recipes/llm_pretrain/megatron_pretrain.yaml
17+
# torchrun --nproc-per-node=8 examples/llm_pretrain/pretrain.py --config examples/llm_pretrain/megatron_pretrain_gpt2.yaml
1818
# Adjust --nproc-per-node to the number of GPUs available on your host machine.
1919

2020

examples/llm_pretrain/pretrain.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
from nemo_automodel.recipes.llm.train_ft import TrainFinetuneRecipeForNextTokenPrediction
1818

1919

20-
def main(default_config_path="examples/llm/nanogpt_pretrain.yaml"):
20+
def main(default_config_path="examples/llm_pretrain/nanogpt_pretrain.yaml"):
2121
"""Entry-point for launching NanoGPT-style pre-training.
2222
2323
The script follows the same invocation pattern as *examples/llm_finetune/finetune.py*:

tools/nanogpt_data_processor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
FineWeb dataset preprocessing script
1717
1818
This tool downloads a dataset from the Hugging Face Hub (default: FineWeb),
19-
tokenizes the data (default: tiktoken + gpt2), and writes memory-mapped binary shards compatible
19+
tokenizes the data (default: GPT-2 via transformers.AutoTokenizer), and writes memory-mapped binary shards compatible
2020
with `BinTokenDataset` for efficient streaming pre-training.
2121
2222
Usage (typical):

0 commit comments

Comments
 (0)