Skip to content

Commit 2bd3207

Browse files
committed
merging main
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
2 parents ad895fa + 8c8ed48 commit 2bd3207

File tree

20 files changed

+1411
-548
lines changed

20 files changed

+1411
-548
lines changed

.github/CODEOWNERS

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ codecov.yml @nvidia-nemo/automation
33
docker/ @nvidia-nemo/automation
44
pyproject.toml @nvidia-nemo/automation
55

6-
docs @akoumpa @jgerh
6+
docs @akoumpa @jgerh @adil-a @HuiyingLi @hemildesai
77
nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai
88
examples @akoumpa @HuiyingLi @adil-a @hemildesai
99
README.md @akoumpa @HuiyingLi @snowmanwwg
1010

11-
nemo_automodel/components/datasets/llm/ @akoumpa @adil-a @NVIDIA-NeMo/automodel_retriever_maintainers
12-
biencoder/ @akoumpa @adil-a @NVIDIA-NeMo/automodel_retriever_maintainers
11+
nemo_automodel/components/datasets/llm/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
12+
biencoder/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers

docs/guides/dataset-overview.md

Lines changed: 95 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,25 @@
1-
# Dataset Overview: LLM and VLM Datasets in NeMo Automodel
1+
# Dataset Overview: LLM, VLM, and Retrieval Datasets in NeMo Automodel
22

3-
This page summarizes the datasets supported in NeMo Automodel for LLMs and VLMs and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
3+
This page summarizes the datasets supported in NeMo Automodel for LLM, VLM, and retrieval/embedding (biencoder) training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
44

5-
- See also: [LLM datasets](llm/dataset.md) and [VLM datasets](vlm/dataset.md) for deeper, task-specific guides.
5+
- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Biencoder retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.
66

77
- If a dataset you need is missing, please open a [GitHub issue](https://github.com/NVIDIA-NeMo/Automodel/issues) with a short description and example schema so we can prioritize support.
8+
---
89

910
## LLM Datasets
1011

1112
NeMo Automodel supports several common patterns for language modeling and instruction tuning.
1213

1314
- **HellaSwag (completion SFT)**
1415
- Wrapper: `nemo_automodel.components.datasets.llm.hellaswag.HellaSwag`
15-
- Use case: single-turn, completion-style SFT where a prompt (context) is followed by a gold continuation
16+
- Use case: single-turn completion style SFT where a prompt (ctx) is followed by a gold continuation (ending)
1617
- Key args: `path_or_dataset`, `split`, `num_samples_limit`
17-
- Example YAML:
18+
### HellaSwag (Completion SFT)
19+
- Wrapper: `nemo_automodel.components.datasets.llm.hellaswag.HellaSwag`
20+
- Use case: single-turn completion-style SFT where a prompt (ctx) is followed by a gold continuation (ending)
21+
- Key args: `path_or_dataset`, `split`, `num_samples_limit`
22+
- Example YAML:
1823
```yaml
1924
dataset:
2025
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
@@ -28,7 +33,14 @@ dataset:
2833
- Notes:
2934
- If the tokenizer has a chat template and you want answer-only loss, you must provide `start_of_turn_token`.
3035
- Optional `seq_length` can be used for padding/truncation.
31-
- Example YAML:
36+
### SQuAD-Style Question Answering (QA) (Instruction SFT)
37+
- Factory: `nemo_automodel.components.datasets.llm.squad.make_squad_dataset`
38+
- Use case: instruction/QA tuning with either prompt-and-answer formatting or chat-template formatting
39+
:::{note}
40+
- If the tokenizer has a chat template and you want answer-only loss, you must provide `start_of_turn_token`.
41+
- Optional `seq_length` can be used for padding/truncation.
42+
:::
43+
- Example YAML:
3244
```yaml
3345
dataset:
3446
_target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
@@ -37,14 +49,20 @@ dataset:
3749
start_of_turn_token: "<|assistant|>"
3850
```
3951

40-
- **ColumnMappedTextInstructionDataset (generic instruction SFT, map-style)**
52+
- **ColumnMappedTextInstructionDataset (generic instruction SFT)**
4153
- Class: `nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset`
4254
- Use case: quickly adapt instruction datasets by mapping your schema's columns to `context`, `question`, `answer`
4355
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
4456
- Notes:
45-
- Map-style, non-streaming dataset (supports `len(ds)` and `ds[i]`)
46-
- For streaming (including Delta Lake / Databricks), use `ColumnMappedTextInstructionIterableDataset`
47-
- Example YAML:
57+
- For tokenizers with chat templates and answer-only loss, you may set `answer_only_loss_mask: true` and provide `start_of_turn_token`.
58+
### ColumnMappedTextInstructionDataset (Generic Instruction SFT)
59+
- Class: `nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset`
60+
- Use case: quickly adapt instruction datasets by mapping your schema's columns to `context`, `question`, `answer`
61+
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
62+
:::{note}
63+
- For tokenizers with chat templates and answer-only loss, you may set `answer_only_loss_mask: true` and provide `start_of_turn_token`.
64+
:::
65+
- Example YAML:
4866
```yaml
4967
dataset:
5068
_target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset
@@ -55,37 +73,17 @@ dataset:
5573
question: inputs
5674
answer: targets
5775
answer_only_loss_mask: true
76+
start_of_turn_token: "<|assistant|>"
5877
```
59-
60-
- **ColumnMappedTextInstructionIterableDataset (generic instruction SFT, streaming)**
61-
- Class: `nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset`
62-
- Use case: stream instruction datasets (including Delta Lake / Databricks)
63-
- Notes:
64-
- Iterable/streaming dataset (iterate; no `len(ds)` / `ds[i]`)
65-
- Always streaming by design (helps avoid accidental dataset materialization/data leakages)
66-
- Example YAML:
67-
```yaml
68-
dataset:
69-
_target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset
70-
path_or_dataset_id: delta://catalog.schema.training_data
71-
column_mapping:
72-
question: user_message
73-
answer: assistant_message
74-
answer_only_loss_mask: true
75-
delta_storage_options:
76-
DATABRICKS_TOKEN: ${oc.env:DATABRICKS_TOKEN}
77-
DATABRICKS_HOST: ${oc.env:DATABRICKS_HOST}
78-
```
79-
80-
See the detailed guides, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md) and [Column-Mapped Text Instruction Iterable Dataset](llm/column-mapped-text-instruction-iterable-dataset.md), for more information.
78+
See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
8179

8280
- **ChatDataset (multi-turn conversations and tool calling)**
8381
- Class: `nemo_automodel.components.datasets.llm.ChatDataset`
8482
- Use case: multi-turn conversations and tool calling in OpenAI chat format
8583
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
8684
- Key args:
87-
- `path_or_dataset_id`: path to local file(s) or Hugging Face dataset ID
88-
- `tokenizer`: tokenizer instance (required; must have chat template support)
85+
- `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
86+
- `tokenizer`: tokenizer instance (required. Must have chat template support)
8987
- `split`: dataset split (e.g., "train", "validation")
9088
- `name`: dataset configuration/subset name
9189
- `seq_length`: maximum sequence length for padding/truncation
@@ -99,7 +97,28 @@ See the detailed guides, [Column-Mapped Text Instruction Dataset](llm/column-map
9997
- Tool definitions are provided in a `tools` field at the conversation level
10098
- Tool calls appear in assistant messages via `tool_calls` field
10199
- Tool responses use the `tool` role
102-
- Example YAML:
100+
### ChatDataset (Multi-Turn Conversations and Tool Calling)
101+
- Class: `nemo_automodel.components.datasets.llm.ChatDataset`
102+
- Use case: multi-turn conversations and tool calling in OpenAI chat format
103+
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
104+
- Key args:
105+
- `path_or_dataset_id`: path to local file(s) or Hugging Face dataset ID
106+
- `tokenizer`: tokenizer instance (required; must have chat template support)
107+
- `split`: dataset split (e.g., "train", "validation")
108+
- `name`: dataset configuration/subset name
109+
- `seq_length`: maximum sequence length for padding/truncation
110+
- `padding`: padding strategy ("do_not_pad", "max_length", etc.)
111+
- `truncation`: truncation strategy ("do_not_truncate", "longest_first", etc.)
112+
- `start_of_turn_token`: token marking assistant response start (for answer-only loss)
113+
- `chat_template`: optional override for tokenizer's chat template
114+
:::{note}
115+
- Requires a tokenizer with chat template support
116+
- Supports both single-turn and multi-turn tool calling
117+
- Tool definitions are provided in a `tools` field at the conversation level
118+
- Tool calls appear in assistant messages through the `tool_calls` field
119+
- Tool responses use the `tool` role
120+
:::
121+
- Example YAML:
103122
```yaml
104123
dataset:
105124
_target_: nemo_automodel.components.datasets.llm.ChatDataset
@@ -205,20 +224,51 @@ dataset:
205224
```
206225
See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
207226

227+
### Retrieval/Biencoder (Embedding Fine-Tuning)
228+
- Factory: `nemo_automodel.components.datasets.llm.make_retrieval_dataset`
229+
- Collator: `nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator`
230+
- Use case: embedding model fine-tuning with (query, positive doc, negative docs) contrastive learning
231+
- Supported schemas:
232+
- Corpus-ID JSON (Merlin/NeMo-retriever style)
233+
- Inline-text JSONL (e.g., `{"query": "...", "pos_doc": "...", "neg_doc": ["...", "..."]}`)
234+
- Example YAML:
235+
```yaml
236+
dataset:
237+
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
238+
data_dir_list: /abs/path/to/train.jsonl
239+
data_type: train
240+
train_n_passages: 5
241+
collate_fn:
242+
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
243+
q_max_len: 512
244+
p_max_len: 512
245+
```
246+
See the detailed guide, [Biencoder retrieval dataset](llm/retrieval-dataset.md), for more information.
247+
208248
- **NanoGPT Binary Shards (pretraining)**
209249
- Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
210250
- Use case: token-level LM pretraining over `.bin` shards produced by NanoGPT-style preprocessors (supports legacy and current formats)
211251
- Notes:
212252
- Streams contiguous `seq_len` slices, supports optional BOS alignment and `.bos.idx` sidecar files
213-
- Related tool: `tools/nanogpt_data_processor.py`
253+
### NanoGPT Binary Shards (Pretraining)
254+
- Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
255+
- Use case: token-level LM pretraining over `.bin` shards produced by NanoGPT-style preprocessors (supports legacy and current formats)
256+
:::{note}
257+
- Streams contiguous `seq_len` slices, supports optional BOS alignment and `.bos.idx` sidecar files
258+
- Related tool: `tools/nanogpt_data_processor.py`
259+
:::
214260

215261
- **Megatron (pretraining; interoperable with pre-tokenized Megatron data)**
216262
- Class: `nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining`
217263
- Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
218264
- Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
219-
- Preprocessing tool: `tools/preprocess_megatron_dataset.py` supports both **JSONL** and **Parquet** input formats, enabling direct preprocessing of Hugging Face datasets stored in Parquet without conversion
220265
- Key args: `paths` (single path, glob, weighted list, or per-split dict), `seq_length`, `tokenizer`, `split`, `index_mapping_dir`, `splits_to_build`
221-
- Example YAML:
266+
### Megatron (Pretraining; Interoperable With Pre-Tokenized Megatron Data)
267+
- Class: `nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining`
268+
- Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
269+
- Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
270+
- Key args: `paths` (single path, glob, weighted list, or per-split dict), `seq_length`, `tokenizer`, `split`, `index_mapping_dir`, `splits_to_build`
271+
- Example YAML:
222272
```yaml
223273
dataset:
224274
_target_: nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining
@@ -240,7 +290,7 @@ packed_sequence:
240290
packed_sequence_size: 8192 # > 0 enables packing
241291
split_across_pack: false
242292
```
243-
Use a collate function that pads to an FP8-friendly multiple when training with FP8:
293+
Use a collator that pads to an FP8-friendly multiple when training with FP8:
244294
```yaml
245295
dataloader:
246296
_target_: torchdata.stateful_dataloader.StatefulDataLoader
@@ -249,6 +299,7 @@ dataloader:
249299
pad_seq_len_divisible: 16
250300
```
251301

302+
---
252303

253304
## VLM Datasets (Vision/Audio + Language)
254305
VLM datasets are represented as conversations (message lists) that combine text with images or audio and are processed with the model's `AutoProcessor.apply_chat_template` and a suitable collate function.
@@ -296,11 +347,12 @@ If you want answer-only loss masking, provide a model-appropriate `start_of_resp
296347

297348
See [Gemma-3n](omni/gemma3-3n.md) and [VLM dataset](vlm/dataset.md) for end-to-end examples.
298349

350+
---
299351

300352
## Bring Your Own Dataset
301353
You can integrate custom datasets with zero code changes to NeMo Automodel by using `_target_` in YAML. There are three approaches:
302354

303-
### 1) Point to an existing class or function (dotted path)
355+
### Point to an Existing Class or Function (Dotted Path)
304356
- LLM example (class):
305357
```yaml
306358
dataset:
@@ -322,7 +374,7 @@ dataset:
322374
split: train
323375
```
324376

325-
### 2) Point to a local Python file and function
377+
### Point to a Local Python File and Function
326378
```yaml
327379
dataset:
328380
_target_: /abs/path/to/my_custom_dataset.py:build_my_dataset
@@ -331,7 +383,7 @@ dataset:
331383
```
332384
Where `build_my_dataset` returns either a `datasets.Dataset` or a list/iterator of conversation dicts (for VLM).
333385

334-
### 3) Use ColumnMappedTextInstructionDataset for most instruction datasets (LLM)
386+
### Use ColumnMappedTextInstructionDataset for Most Instruction Datasets (LLM)
335387
- Ideal when your data has columns like `instruction`, `input`, or `output` but with arbitrary names
336388
- Supports local JSON/JSONL and HF Hub
337389
```yaml
@@ -346,7 +398,7 @@ dataset:
346398
start_of_turn_token: "<|assistant|>"
347399
```
348400

349-
### Minimal Custom Class Pattern (LLM Completion)
401+
### Implement a Minimal Custom Class Pattern (LLM Completion)
350402
If you prefer Python, implement `get_context` and `get_target` and reuse the built-in preprocessor:
351403
```python
352404
from datasets import load_dataset
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Biencoder Retrieval Dataset (Embedding Fine-tuning)
2+
3+
NeMo Automodel supports **biencoder/embedding model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.
4+
5+
This dataset is used by the biencoder recipes (see `examples/biencoder/`) together with the `RetrievalBiencoderCollator`.
6+
7+
## What the Biencoder Consumes
8+
9+
The dataset factory `nemo_automodel.components.datasets.llm.make_retrieval_dataset` returns a Hugging Face `datasets.Dataset`. At runtime it transforms each raw record into the training-time schema:
10+
11+
- `question`: query string
12+
- `doc_text`: list of document texts in the order `[positive, negative_1, negative_2, ...]`
13+
- `doc_image`: list of images (or empty strings), aligned with `doc_text`
14+
- `query_instruction` / `passage_instruction`: optional, used when `use_dataset_instruction: true` and the corpus provides instructions via metadata
15+
16+
## Supported Input Formats
17+
18+
NeMo Automodel supports **two** input schemas:
19+
20+
### Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)
21+
22+
This is the format used by NeMo retriever pipelines where documents live in a separate **corpus** and training examples reference documents by **ID**.
23+
24+
**Training file example (single JSON):**
25+
26+
```json
27+
{
28+
"corpus": [
29+
{ "path": "/abs/path/to/wiki_corpus" }
30+
],
31+
"data": [
32+
{
33+
"question_id": "q_001",
34+
"question": "Explain transformers",
35+
"corpus_id": "wiki_corpus",
36+
"pos_doc": [{ "id": "d_123" }],
37+
"neg_doc": [{ "id": "d_456" }, "d_789"]
38+
}
39+
]
40+
}
41+
```
42+
43+
**Corpus requirements**
44+
45+
Each corpus directory must contain a `merlin_metadata.json` file.
46+
47+
Minimal example:
48+
49+
```json
50+
{ "class": "TextQADataset", "corpus_id": "wiki_corpus" }
51+
```
52+
53+
:::{note}
54+
- `pos_doc` and `neg_doc` can be lists of `{"id": ...}` dicts or raw IDs (they are normalized internally).
55+
- If you set `use_dataset_instruction: true`, optional fields like `query_instruction` and `passage_instruction` in `merlin_metadata.json` are surfaced to the collator.
56+
:::
57+
58+
### Inline-Text JSONL (No Corpus Required)
59+
60+
This is convenient for custom fine-tuning pipelines where the documents are included **inline**.
61+
62+
**JSONL example (one example per line):**
63+
64+
```json
65+
{"query":"Explain transformers","pos_doc":"Transformers are a type of neural network...","neg_doc":["RNNs are...","CNNs are..."]}
66+
{"query":"What is Python?","pos_doc":["A programming language."],"neg_doc":"A snake."}
67+
```
68+
69+
:::{note}
70+
- `query` is accepted (`question` is also accepted as an alias).
71+
- `pos_doc` and `neg_doc` can be either:
72+
- strings (interpreted as document text), or
73+
- lists of strings, or
74+
- dicts with at least `text` (optionally `image`, `nr_ocr`) for multimodal use cases.
75+
- If `corpus_id` is not provided, it defaults to `__inline__`.
76+
- `use_dataset_instruction: true` has no effect for pure inline records (instructions come from corpus metadata).
77+
:::
78+
79+
## YAML Usage (Dataset + Collator)
80+
81+
Use the dataset factory plus the biencoder collator:
82+
83+
```yaml
84+
dataloader:
85+
_target_: torchdata.stateful_dataloader.StatefulDataLoader
86+
dataset:
87+
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
88+
data_dir_list:
89+
- /abs/path/to/train.jsonl # or train.json (corpus-id format)
90+
data_type: train
91+
train_n_passages: 5 # 1 positive + 4 negatives
92+
do_shuffle: true
93+
use_dataset_instruction: false
94+
collate_fn:
95+
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
96+
q_max_len: 512
97+
p_max_len: 512
98+
query_prefix: "query:"
99+
passage_prefix: "passage:"
100+
pad_to_multiple_of: 8
101+
```
102+
103+
## Requirements
104+
105+
- `pos_doc` must be **non-empty**.
106+
- If training requests negatives (e.g., `train_n_passages > 1`), `neg_doc` must contain **at least one** document (the loader will cycle negatives if you provide fewer than needed).

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ model-coverage/vlm.md
5454
5555
guides/dataset-overview.md
5656
guides/llm/dataset.md
57+
guides/llm/retrieval-dataset.md
5758
guides/llm/column-mapped-text-instruction-dataset.md
5859
guides/llm/column-mapped-text-instruction-iterable-dataset.md
5960
guides/vlm/dataset.md

0 commit comments

Comments
 (0)