Skip to content

Commit 2480cb0

Browse files
authored
Merge branch 'main' into transformers_v5_rc0
2 parents 429acad + 6960091 commit 2480cb0

File tree

26 files changed

+1718
-1223
lines changed

26 files changed

+1718
-1223
lines changed

.github/CODEOWNERS

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
docker/ @nvidia-nemo/automation
33
pyproject.toml @nvidia-nemo/automation
44

5-
nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia
6-
examples @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia
7-
README.md @akoumpa @HuiyingLi
5+
docs @akoumpa @jgerh
6+
nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia @rnyak @oliverholworthy @gabrielspmoreira
7+
examples @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia @rnyak @oliverholworthy @gabrielspmoreira
8+
README.md @akoumpa @HuiyingLi @snowmanwwg

docs/guides/dataset-overview.md

Lines changed: 136 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Dataset Overview: LLM and VLM Datasets in NeMo Automodel
22

3-
This page summarizes the datasets already supported in NeMo Automodel for LLM and VLM, and shows how to plug in your own datasets via simple Python functions or purely through YAML using the `_target_` mechanism.
3+
This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
44

55
- See also: [LLM datasets](llm/dataset.md) and [VLM datasets](vlm/dataset.md) for deeper, task-specific guides.
66

@@ -23,7 +23,7 @@ dataset:
2323
split: train
2424
```
2525
26-
- **SQuAD-style QA (instruction SFT)**
26+
- **SQuAD-style Question Answering (QA) (instruction SFT)**
2727
- Factory: `nemo_automodel.components.datasets.llm.squad.make_squad_dataset`
2828
- Use case: instruction/QA tuning with either prompt+answer formatting or chat-template formatting
2929
- Notes:
@@ -57,7 +57,133 @@ dataset:
5757
answer_only_loss_mask: true
5858
start_of_turn_token: "<|assistant|>"
5959
```
60-
- See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
60+
See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
61+
62+
- **ChatDataset (multi-turn conversations and tool calling)**
63+
- Class: `nemo_automodel.components.datasets.llm.ChatDataset`
64+
- Use case: multi-turn conversations and tool calling in OpenAI chat format
65+
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
66+
- Key args:
67+
- `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
68+
- `tokenizer`: tokenizer instance (required. Must have chat template support)
69+
- `split`: dataset split (e.g., "train", "validation")
70+
- `name`: dataset configuration/subset name
71+
- `seq_length`: maximum sequence length for padding/truncation
72+
- `padding`: padding strategy ("do_not_pad", "max_length", etc.)
73+
- `truncation`: truncation strategy ("do_not_truncate", "longest_first", etc.)
74+
- `start_of_turn_token`: token marking assistant response start (for answer-only loss)
75+
- `chat_template`: optional override for tokenizer's chat template
76+
- Notes:
77+
- Requires a tokenizer with chat template support
78+
- Supports both single-turn and multi-turn tool calling
79+
- Tool definitions are provided in a `tools` field at the conversation level
80+
- Tool calls appear in assistant messages via `tool_calls` field
81+
- Tool responses use the `tool` role
82+
- Example YAML:
83+
```yaml
84+
dataset:
85+
_target_: nemo_automodel.components.datasets.llm.ChatDataset
86+
path_or_dataset_id: Salesforce/xlam-function-calling-60k
87+
split: train
88+
tokenizer:
89+
_target_: transformers.AutoTokenizer.from_pretrained
90+
pretrained_model_name_or_path: google/functiongemma-270m-it
91+
seq_length: 2048
92+
start_of_turn_token: "<start_of_turn>"
93+
```
94+
- Expected data format (OpenAI messages format):
95+
```json
96+
{
97+
"messages": [
98+
{
99+
"role": "user",
100+
"content": "What's the weather in Seattle?"
101+
},
102+
{
103+
"role": "assistant",
104+
"content": "",
105+
"tool_calls": [
106+
{
107+
"id": "call_1",
108+
"type": "function",
109+
"function": {
110+
"name": "get_weather",
111+
"arguments": "{\"city\": \"Seattle\"}"
112+
}
113+
}
114+
]
115+
},
116+
{
117+
"role": "tool",
118+
"tool_call_id": "call_1",
119+
"content": "{\"temperature\": 65, \"condition\": \"cloudy\"}"
120+
},
121+
{
122+
"role": "assistant",
123+
"content": "It's 65°F and cloudy in Seattle."
124+
}
125+
],
126+
"tools": [
127+
{
128+
"type": "function",
129+
"function": {
130+
"name": "get_weather",
131+
"description": "Get current weather for a city",
132+
"parameters": {
133+
"type": "object",
134+
"properties": {
135+
"city": {"type": "string"}
136+
},
137+
"required": ["city"]
138+
}
139+
}
140+
}
141+
]
142+
}
143+
```
144+
- For single-turn tool calling (one tool call per conversation), omit the tool response and final assistant message:
145+
```json
146+
{
147+
"messages": [
148+
{
149+
"role": "user",
150+
"content": "Book a table for two at 7pm in Seattle."
151+
},
152+
{
153+
"role": "assistant",
154+
"content": "",
155+
"tool_calls": [
156+
{
157+
"id": "call_1",
158+
"type": "function",
159+
"function": {
160+
"name": "book_table",
161+
"arguments": "{\"party_size\": 2, \"time\": \"19:00\", \"city\": \"Seattle\"}"
162+
}
163+
}
164+
]
165+
}
166+
],
167+
"tools": [
168+
{
169+
"type": "function",
170+
"function": {
171+
"name": "book_table",
172+
"description": "Book a restaurant table",
173+
"parameters": {
174+
"type": "object",
175+
"properties": {
176+
"party_size": {"type": "integer"},
177+
"time": {"type": "string"},
178+
"city": {"type": "string"}
179+
}
180+
}
181+
}
182+
}
183+
]
184+
}
185+
```
186+
See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
61187

62188
- **NanoGPT Binary Shards (pretraining)**
63189
- Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
@@ -69,7 +195,7 @@ dataset:
69195
- **Megatron (pretraining; interoperable with pre-tokenized Megatron data)**
70196
- Class: `nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining`
71197
- Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
72-
- Interoperability: if your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly; no re-tokenization required
198+
- Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
73199
- Key args: `paths` (single path, glob, weighted list, or per-split dict), `seq_length`, `tokenizer`, `split`, `index_mapping_dir`, `splits_to_build`
74200
- Example YAML:
75201
```yaml
@@ -84,9 +210,7 @@ dataset:
84210
split: "0.99, 0.01, 0.00" # train, validation, test
85211
splits_to_build: "train"
86212
```
87-
- See the detailed pretraining guide, [Megatron MCore Pretraining](llm/mcore-pretraining.md), which uses MegatronPretraining data.
88-
89-
> ⚠️ Note: Multi-turn conversational and tool-calling/function-calling dataset support is coming soon.
213+
See the detailed [pretraining guide](llm/pretraining.md), which uses MegatronPretraining data.
90214

91215
## Packed Sequence Support
92216
To reduce padding and improve throughput with variable-length sequences:
@@ -111,9 +235,10 @@ VLM datasets are represented as conversations (message lists) that combine text
111235

112236
Built-in dataset makers (return lists of `conversation` dicts):
113237
- **RDR items**: `nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset` (HF: `quintend/rdr-items`)
114-
- **CORD-V2 receipts**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
115-
- **MedPix-VQA (medical)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
116-
- **CommonVoice 17 (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
238+
- **CORD-V2 receipts (Consolidated Receipt Dataset for Post-OCR Parsing)**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
239+
- **MedPix-VQA (Medical Pixel Question Answering)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
240+
- **CommonVoice 17 (CV17) (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
241+
117242

118243
Each example follows the conversation schema expected by `apply_chat_template`, e.g.:
119244
```python
@@ -188,7 +313,7 @@ dataset:
188313
Where `build_my_dataset` returns either a `datasets.Dataset` or a list/iterator of conversation dicts (for VLM).
189314

190315
### 3) Use ColumnMappedTextInstructionDataset for most instruction datasets (LLM)
191-
- Ideal when your data has columns like `instruction`, `input`, `output` but with arbitrary names
316+
- Ideal when your data has columns like `instruction`, `input`, or `output` but with arbitrary names
192317
- Supports local JSON/JSONL and HF Hub
193318
```yaml
194319
dataset:

0 commit comments

Comments
 (0)