You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/dataset-overview.md
+95-43Lines changed: 95 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,25 @@
1
-
# Dataset Overview: LLMand VLM Datasets in NeMo Automodel
1
+
# Dataset Overview: LLM, VLM, and Retrieval Datasets in NeMo Automodel
2
2
3
-
This page summarizes the datasets supported in NeMo Automodel for LLMs and VLMs and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
3
+
This page summarizes the datasets supported in NeMo Automodel for LLM, VLM, and retrieval/embedding (biencoder) training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
4
4
5
-
- See also: [LLM datasets](llm/dataset.md) and [VLM datasets](vlm/dataset.md) for deeper, task-specific guides.
5
+
- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Biencoder retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.
6
6
7
7
- If a dataset you need is missing, please open a [GitHub issue](https://github.com/NVIDIA-NeMo/Automodel/issues) with a short description and example schema so we can prioritize support.
8
+
---
8
9
9
10
## LLM Datasets
10
11
11
12
NeMo Automodel supports several common patterns for language modeling and instruction tuning.
See the detailed guides, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md) and [Column-Mapped Text Instruction Iterable Dataset](llm/column-mapped-text-instruction-iterable-dataset.md), for more information.
78
+
See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
81
79
82
80
- **ChatDataset (multi-turn conversations and tool calling)**
- Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
218
264
- Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
219
-
- Preprocessing tool: `tools/preprocess_megatron_dataset.py`supports both **JSONL** and **Parquet** input formats, enabling direct preprocessing of Hugging Face datasets stored in Parquet without conversion
- Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
269
+
- Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
VLM datasets are represented as conversations (message lists) that combine text with images or audio and are processed with the model's `AutoProcessor.apply_chat_template` and a suitable collate function.
@@ -296,11 +347,12 @@ If you want answer-only loss masking, provide a model-appropriate `start_of_resp
296
347
297
348
See [Gemma-3n](omni/gemma3-3n.md) and [VLM dataset](vlm/dataset.md) for end-to-end examples.
298
349
350
+
---
299
351
300
352
## Bring Your Own Dataset
301
353
You can integrate custom datasets with zero code changes to NeMo Automodel by using `_target_` in YAML. There are three approaches:
302
354
303
-
### 1) Point to an existing class or function (dotted path)
355
+
### Point to an Existing Class or Function (Dotted Path)
NeMo Automodel supports **biencoder/embedding model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.
4
+
5
+
This dataset is used by the biencoder recipes (see `examples/biencoder/`) together with the `RetrievalBiencoderCollator`.
6
+
7
+
## What the Biencoder Consumes
8
+
9
+
The dataset factory `nemo_automodel.components.datasets.llm.make_retrieval_dataset` returns a Hugging Face `datasets.Dataset`. At runtime it transforms each raw record into the training-time schema:
10
+
11
+
-`question`: query string
12
+
-`doc_text`: list of document texts in the order `[positive, negative_1, negative_2, ...]`
13
+
-`doc_image`: list of images (or empty strings), aligned with `doc_text`
14
+
-`query_instruction` / `passage_instruction`: optional, used when `use_dataset_instruction: true` and the corpus provides instructions via metadata
15
+
16
+
## Supported Input Formats
17
+
18
+
NeMo Automodel supports **two** input schemas:
19
+
20
+
### Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)
21
+
22
+
This is the format used by NeMo retriever pipelines where documents live in a separate **corpus** and training examples reference documents by **ID**.
23
+
24
+
**Training file example (single JSON):**
25
+
26
+
```json
27
+
{
28
+
"corpus": [
29
+
{ "path": "/abs/path/to/wiki_corpus" }
30
+
],
31
+
"data": [
32
+
{
33
+
"question_id": "q_001",
34
+
"question": "Explain transformers",
35
+
"corpus_id": "wiki_corpus",
36
+
"pos_doc": [{ "id": "d_123" }],
37
+
"neg_doc": [{ "id": "d_456" }, "d_789"]
38
+
}
39
+
]
40
+
}
41
+
```
42
+
43
+
**Corpus requirements**
44
+
45
+
Each corpus directory must contain a `merlin_metadata.json` file.
-`pos_doc` and `neg_doc` can be lists of `{"id": ...}` dicts or raw IDs (they are normalized internally).
55
+
- If you set `use_dataset_instruction: true`, optional fields like `query_instruction` and `passage_instruction` in `merlin_metadata.json` are surfaced to the collator.
56
+
:::
57
+
58
+
### Inline-Text JSONL (No Corpus Required)
59
+
60
+
This is convenient for custom fine-tuning pipelines where the documents are included **inline**.
61
+
62
+
**JSONL example (one example per line):**
63
+
64
+
```json
65
+
{"query":"Explain transformers","pos_doc":"Transformers are a type of neural network...","neg_doc":["RNNs are...","CNNs are..."]}
66
+
{"query":"What is Python?","pos_doc":["A programming language."],"neg_doc":"A snake."}
67
+
```
68
+
69
+
:::{note}
70
+
-`query` is accepted (`question` is also accepted as an alias).
71
+
-`pos_doc` and `neg_doc` can be either:
72
+
- strings (interpreted as document text), or
73
+
- lists of strings, or
74
+
- dicts with at least `text` (optionally `image`, `nr_ocr`) for multimodal use cases.
75
+
- If `corpus_id` is not provided, it defaults to `__inline__`.
76
+
-`use_dataset_instruction: true` has no effect for pure inline records (instructions come from corpus metadata).
77
+
:::
78
+
79
+
## YAML Usage (Dataset + Collator)
80
+
81
+
Use the dataset factory plus the biencoder collator:
- If training requests negatives (e.g., `train_n_passages > 1`), `neg_doc` must contain **at least one** document (the loader will cycle negatives if you provide fewer than needed).
0 commit comments