11# Dataset Overview: LLM and VLM Datasets in NeMo Automodel
22
3- This page summarizes the datasets already supported in NeMo Automodel for LLM and VLM, and shows how to plug in your own datasets via simple Python functions or purely through YAML using the ` _target_ ` mechanism.
3+ This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML ` _target_ ` mechanism.
44
55- See also: [ LLM datasets] ( llm/dataset.md ) and [ VLM datasets] ( vlm/dataset.md ) for deeper, task-specific guides.
66
@@ -23,7 +23,7 @@ dataset:
2323 split : train
2424` ` `
2525
26- - **SQuAD-style QA (instruction SFT)**
26+ - **SQuAD-style Question Answering (QA) (instruction SFT)**
2727 - Factory: ` nemo_automodel.components.datasets.llm.squad.make_squad_dataset`
2828 - Use case : instruction/QA tuning with either prompt+answer formatting or chat-template formatting
2929 - Notes :
@@ -57,7 +57,133 @@ dataset:
5757 answer_only_loss_mask: true
5858 start_of_turn_token: "<|assistant|>"
5959` ` `
60- - See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
60+ See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
61+
62+ - **ChatDataset (multi-turn conversations and tool calling)**
63+ - Class : ` nemo_automodel.components.datasets.llm.ChatDataset`
64+ - Use case : multi-turn conversations and tool calling in OpenAI chat format
65+ - Sources : local JSON/JSONL or Hugging Face Hub dataset ID
66+ - Key args :
67+ - `path_or_dataset_id` : path to local file(s) or HuggingFace dataset ID
68+ - `tokenizer` : tokenizer instance (required. Must have chat template support)
69+ - `split` : dataset split (e.g., "train", "validation")
70+ - `name` : dataset configuration/subset name
71+ - `seq_length` : maximum sequence length for padding/truncation
72+ - `padding` : padding strategy ("do_not_pad", "max_length", etc.)
73+ - `truncation` : truncation strategy ("do_not_truncate", "longest_first", etc.)
74+ - `start_of_turn_token` : token marking assistant response start (for answer-only loss)
75+ - `chat_template` : optional override for tokenizer's chat template
76+ - Notes :
77+ - Requires a tokenizer with chat template support
78+ - Supports both single-turn and multi-turn tool calling
79+ - Tool definitions are provided in a `tools` field at the conversation level
80+ - Tool calls appear in assistant messages via `tool_calls` field
81+ - Tool responses use the `tool` role
82+ - Example YAML :
83+ ` ` ` yaml
84+ dataset:
85+ _target_: nemo_automodel.components.datasets.llm.ChatDataset
86+ path_or_dataset_id: Salesforce/xlam-function-calling-60k
87+ split: train
88+ tokenizer:
89+ _target_: transformers.AutoTokenizer.from_pretrained
90+ pretrained_model_name_or_path: google/functiongemma-270m-it
91+ seq_length: 2048
92+ start_of_turn_token: "<start_of_turn>"
93+ ` ` `
94+ - Expected data format (OpenAI messages format) :
95+ ` ` ` json
96+ {
97+ "messages": [
98+ {
99+ "role": "user",
100+ "content": "What's the weather in Seattle?"
101+ },
102+ {
103+ "role": "assistant",
104+ "content": "",
105+ "tool_calls": [
106+ {
107+ "id": "call_1",
108+ "type": "function",
109+ "function": {
110+ "name": "get_weather",
111+ "arguments": "{\" city\" : \" Seattle\" }"
112+ }
113+ }
114+ ]
115+ },
116+ {
117+ "role": "tool",
118+ "tool_call_id": "call_1",
119+ "content": "{\" temperature\" : 65, \" condition\" : \" cloudy\" }"
120+ },
121+ {
122+ "role": "assistant",
123+ "content": "It's 65°F and cloudy in Seattle."
124+ }
125+ ],
126+ "tools": [
127+ {
128+ "type": "function",
129+ "function": {
130+ "name": "get_weather",
131+ "description": "Get current weather for a city",
132+ "parameters": {
133+ "type": "object",
134+ "properties": {
135+ "city": {"type": "string"}
136+ },
137+ "required": ["city"]
138+ }
139+ }
140+ }
141+ ]
142+ }
143+ ` ` `
144+ - For single-turn tool calling (one tool call per conversation), omit the tool response and final assistant message :
145+ ` ` ` json
146+ {
147+ "messages": [
148+ {
149+ "role": "user",
150+ "content": "Book a table for two at 7pm in Seattle."
151+ },
152+ {
153+ "role": "assistant",
154+ "content": "",
155+ "tool_calls": [
156+ {
157+ "id": "call_1",
158+ "type": "function",
159+ "function": {
160+ "name": "book_table",
161+ "arguments": "{\" party_size\" : 2, \" time\" : \" 19:00\" , \" city\" : \" Seattle\" }"
162+ }
163+ }
164+ ]
165+ }
166+ ],
167+ "tools": [
168+ {
169+ "type": "function",
170+ "function": {
171+ "name": "book_table",
172+ "description": "Book a restaurant table",
173+ "parameters": {
174+ "type": "object",
175+ "properties": {
176+ "party_size": {"type": "integer"},
177+ "time": {"type": "string"},
178+ "city": {"type": "string"}
179+ }
180+ }
181+ }
182+ }
183+ ]
184+ }
185+ ` ` `
186+ See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
61187
62188- **NanoGPT Binary Shards (pretraining)**
63189 - Class : ` nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
@@ -69,7 +195,7 @@ dataset:
69195- **Megatron (pretraining; interoperable with pre-tokenized Megatron data)**
70196 - Class : ` nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining`
71197 - Use case : large-scale LM pretraining over Megatron-LM formatted tokenized corpora
72- - Interoperability : if your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly; no re-tokenization required
198+ - Interoperability : If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
73199 - Key args : ` paths` (single path, glob, weighted list, or per-split dict), `seq_length`, `tokenizer`, `split`, `index_mapping_dir`, `splits_to_build`
74200 - Example YAML :
75201` ` ` yaml
@@ -84,9 +210,7 @@ dataset:
84210 split: "0.99, 0.01, 0.00" # train, validation, test
85211 splits_to_build: "train"
86212` ` `
87- - See the detailed pretraining guide, [Megatron MCore Pretraining](llm/mcore-pretraining.md), which uses MegatronPretraining data.
88-
89- > ⚠️ Note: Multi-turn conversational and tool-calling/function-calling dataset support is coming soon.
213+ See the detailed [pretraining guide](llm/pretraining.md), which uses MegatronPretraining data.
90214
91215# # Packed Sequence Support
92216To reduce padding and improve throughput with variable-length sequences :
@@ -111,9 +235,10 @@ VLM datasets are represented as conversations (message lists) that combine text
111235
112236Built-in dataset makers (return lists of `conversation` dicts) :
113237- **RDR items**: `nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset` (HF: `quintend/rdr-items`)
114- - **CORD-V2 receipts**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
115- - **MedPix-VQA (medical)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
116- - **CommonVoice 17 (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
238+ - **CORD-V2 receipts (Consolidated Receipt Dataset for Post-OCR Parsing)**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
239+ - **MedPix-VQA (Medical Pixel Question Answering)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
240+ - **CommonVoice 17 (CV17) (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
241+
117242
118243Each example follows the conversation schema expected by `apply_chat_template`, e.g. :
119244` ` ` python
@@ -188,7 +313,7 @@ dataset:
188313Where `build_my_dataset` returns either a `datasets.Dataset` or a list/iterator of conversation dicts (for VLM).
189314
190315# ## 3) Use ColumnMappedTextInstructionDataset for most instruction datasets (LLM)
191- - Ideal when your data has columns like `instruction`, `input`, `output` but with arbitrary names
316+ - Ideal when your data has columns like `instruction`, `input`, or `output` but with arbitrary names
192317- Supports local JSON/JSONL and HF Hub
193318` ` ` yaml
194319dataset:
0 commit comments