NVIDIA-NeMo
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 4 additions & 3 deletions b/‎.github/CODEOWNERS‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/guides/dataset-overview.md‎
Lines changed: 136 additions & 11 deletions b/‎docs/guides/dataset-overview.md‎
Lines changed: 136 additions & 11 deletions
diff --git a/‎docs/guides/fp8-training.md‎
Lines changed: 6 additions & 1 deletion b/‎docs/guides/fp8-training.md‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/guides/llm/functiongemma-peft-loss.png‎
39.1 KB b/‎docs/guides/llm/functiongemma-peft-loss.png‎
39.1 KB
diff --git a/‎docs/guides/llm/functiongemma-sft-loss.png‎
37.5 KB b/‎docs/guides/llm/functiongemma-sft-loss.png‎
37.5 KB
@@ -2,6 +2,7 @@
 docker/ @nvidia-nemo/automation
 pyproject.toml @nvidia-nemo/automation
 
-nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai
-examples @akoumpa @HuiyingLi @adil-a @hemildesai
-README.md @akoumpa @HuiyingLi
+docs @akoumpa @jgerh
+nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia @rnyak @oliverholworthy @gabrielspmoreira
+examples @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia @rnyak @oliverholworthy @gabrielspmoreira
+README.md @akoumpa @HuiyingLi @snowmanwwg
@@ -20,6 +20,8 @@
 </div>
 
 ## 📣 News and Discussions
+- [12/18/2025][FunctionGemma](https://huggingface.co/google/functiongemma-270m-it) is out! Finetune it with [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/toolcalling.md)!
+- [12/15/2025][NVIDIA-Nemotron-3-Nano-30B-A3B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8) is out! Finetune it with [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel/discussions/976)!
 - [11/6/2025][Accelerating Large-Scale Mixture-of-Experts Training in PyTorch](https://developer.nvidia.com/blog/accelerating-large-scale-mixture-of-experts-training-in-pytorch/)
 - [10/6/2025][Enabling PyTorch Native Pipeline Parallelism for 🤗 Hugging Face Transformer Models](https://github.com/NVIDIA-NeMo/Automodel/discussions/589)
 - [9/22/2025][Fine-tune Hugging Face Models Instantly with Day-0 Support with NVIDIA NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel/discussions/477)
 
@@ -1,6 +1,6 @@
 # Dataset Overview: LLM and VLM Datasets in NeMo Automodel
 
-This page summarizes the datasets already supported in NeMo Automodel for LLM and VLM, and shows how to plug in your own datasets via simple Python functions or purely through YAML using the `_target_` mechanism.
+This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
 
 - See also: [LLM datasets](llm/dataset.md) and [VLM datasets](vlm/dataset.md) for deeper, task-specific guides.
 
@@ -23,7 +23,7 @@ dataset:
   split: train
 ```
 
-- **SQuAD-style QA (instruction SFT)**
+- **SQuAD-style Question Answering (QA) (instruction SFT)**
   - Factory: `nemo_automodel.components.datasets.llm.squad.make_squad_dataset`
   - Use case: instruction/QA tuning with either prompt+answer formatting or chat-template formatting
   - Notes:
@@ -57,7 +57,133 @@ dataset:
   answer_only_loss_mask: true
   start_of_turn_token: "<|assistant|>"
 ```
-  - See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
+See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
+
+- **ChatDataset (multi-turn conversations and tool calling)**
+  - Class: `nemo_automodel.components.datasets.llm.ChatDataset`
+  - Use case: multi-turn conversations and tool calling in OpenAI chat format
+  - Sources: local JSON/JSONL or Hugging Face Hub dataset ID
+  - Key args:
+    - `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
+    - `tokenizer`: tokenizer instance (required. Must have chat template support)
+    - `split`: dataset split (e.g., "train", "validation")
+    - `name`: dataset configuration/subset name
+    - `seq_length`: maximum sequence length for padding/truncation
+    - `padding`: padding strategy ("do_not_pad", "max_length", etc.)
+    - `truncation`: truncation strategy ("do_not_truncate", "longest_first", etc.)
+    - `start_of_turn_token`: token marking assistant response start (for answer-only loss)
+    - `chat_template`: optional override for tokenizer's chat template
+  - Notes:
+    - Requires a tokenizer with chat template support
+    - Supports both single-turn and multi-turn tool calling
+    - Tool definitions are provided in a `tools` field at the conversation level
+    - Tool calls appear in assistant messages via `tool_calls` field
+    - Tool responses use the `tool` role
+  - Example YAML:
+```yaml
+dataset:
+  _target_: nemo_automodel.components.datasets.llm.ChatDataset
+  path_or_dataset_id: Salesforce/xlam-function-calling-60k
+  split: train
+  tokenizer:
+    _target_: transformers.AutoTokenizer.from_pretrained
+    pretrained_model_name_or_path: google/functiongemma-270m-it
+  seq_length: 2048
+  start_of_turn_token: "<start_of_turn>"
+```
+  - Expected data format (OpenAI messages format):
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "What's the weather in Seattle?"
+    },
+    {
+      "role": "assistant",
+      "content": "",
+      "tool_calls": [
+        {
+          "id": "call_1",
+          "type": "function",
+          "function": {
+            "name": "get_weather",
+            "arguments": "{\"city\": \"Seattle\"}"
+          }
+        }
+      ]
+    },
+    {
+      "role": "tool",
+      "tool_call_id": "call_1",
+      "content": "{\"temperature\": 65, \"condition\": \"cloudy\"}"
+    },
+    {
+      "role": "assistant",
+      "content": "It's 65°F and cloudy in Seattle."
+    }
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "description": "Get current weather for a city",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "city": {"type": "string"}
+          },
+          "required": ["city"]
+        }
+      }
+    }
+  ]
+}
+```
+  - For single-turn tool calling (one tool call per conversation), omit the tool response and final assistant message:
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "Book a table for two at 7pm in Seattle."
+    },
+    {
+      "role": "assistant",
+      "content": "",
+      "tool_calls": [
+        {
+          "id": "call_1",
+          "type": "function",
+          "function": {
+            "name": "book_table",
+            "arguments": "{\"party_size\": 2, \"time\": \"19:00\", \"city\": \"Seattle\"}"
+          }
+        }
+      ]
+    }
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "book_table",
+        "description": "Book a restaurant table",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "party_size": {"type": "integer"},
+            "time": {"type": "string"},
+            "city": {"type": "string"}
+          }
+        }
+      }
+    }
+  ]
+}
+```
+See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
 
 - **NanoGPT Binary Shards (pretraining)**
   - Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
@@ -69,7 +195,7 @@ dataset:
 - **Megatron (pretraining; interoperable with pre-tokenized Megatron data)**
   - Class: `nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining`
   - Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
-  - Interoperability: if your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly; no re-tokenization required
+  - Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
   - Key args: `paths` (single path, glob, weighted list, or per-split dict), `seq_length`, `tokenizer`, `split`, `index_mapping_dir`, `splits_to_build`
   - Example YAML:
 ```yaml
@@ -84,9 +210,7 @@ dataset:
   split: "0.99, 0.01, 0.00"  # train, validation, test
   splits_to_build: "train"
 ```
- - See the detailed pretraining guide, [Megatron MCore Pretraining](llm/mcore-pretraining.md), which uses MegatronPretraining data.
-
-> ⚠️ Note: Multi-turn conversational and tool-calling/function-calling dataset support is coming soon.
+See the detailed [pretraining guide](llm/pretraining.md), which uses MegatronPretraining data.
 
 ## Packed Sequence Support
 To reduce padding and improve throughput with variable-length sequences:
@@ -111,9 +235,10 @@ VLM datasets are represented as conversations (message lists) that combine text
 
 Built-in dataset makers (return lists of `conversation` dicts):
 - **RDR items**: `nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset` (HF: `quintend/rdr-items`)
-- **CORD-V2 receipts**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
-- **MedPix-VQA (medical)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
-- **CommonVoice 17 (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
+- **CORD-V2 receipts (Consolidated Receipt Dataset for Post-OCR Parsing)**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
+- **MedPix-VQA (Medical Pixel Question Answering)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
+- **CommonVoice 17 (CV17) (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
+
 
 Each example follows the conversation schema expected by `apply_chat_template`, e.g.:
 ```python
@@ -188,7 +313,7 @@ dataset:
 Where `build_my_dataset` returns either a `datasets.Dataset` or a list/iterator of conversation dicts (for VLM).
 
 ### 3) Use ColumnMappedTextInstructionDataset for most instruction datasets (LLM)
-- Ideal when your data has columns like `instruction`, `input`, `output` but with arbitrary names
+- Ideal when your data has columns like `instruction`, `input`, or `output` but with arbitrary names
 - Supports local JSON/JSONL and HF Hub
 ```yaml
 dataset:
 
@@ -93,7 +93,12 @@ FP8 quantization provides measurable performance improvements while maintaining
 - **Convergence**: FP8 training achieves loss parity with BF16 training.
 - **Memory**: FP8 training achieves on par memory usage with BF16 baseline.
 
-<img src="fp8_convergence.jpg" alt="FP8 Convergence Comparison" width="600px" />
+```{image} fp8_convergence.jpg
+:alt: FP8 Convergence Comparison
+:class: bg-primary
+:width: 600px
+:align: center
+```
 
 *Figure: Loss curves comparing FP8 tensorwise scaling + torch.compile vs. BF16 + torch.compile training on 8xH100 with 8k sequence length, demonstrating virtually identical convergence behavior with 1.24x speedup*