You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/dataset-overview.md
+141Lines changed: 141 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,7 @@ dataset:
44
44
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
45
45
- Notes:
46
46
- For tokenizers with chat templates and answer-only loss, you may set `answer_only_loss_mask: true` and provide `start_of_turn_token`.
47
+
- Supports streaming mode for large datasets (see [Streaming Datasets](#streaming-datasets) section below).
47
48
- Example YAML:
48
49
```yaml
49
50
dataset:
@@ -212,6 +213,146 @@ dataset:
212
213
```
213
214
See the detailed [pretraining guide](llm/pretraining.md), which uses MegatronPretraining data.
214
215
216
+
## Streaming Datasets
217
+
218
+
Streaming datasets enable processing very large datasets without loading them entirely into memory. This is particularly useful when working with datasets that exceed available RAM or when you want to start training immediately without waiting for the full dataset to download.
219
+
220
+
### What Are Streaming Datasets?
221
+
222
+
Streaming datasets load and process data incrementally, one batch at a time, rather than loading the entire dataset into memory upfront. This approach:
223
+
224
+
- **Reduces memory footprint**: Only the current batch resides in memory
225
+
- **Enables training on massive datasets**: Process terabyte-scale datasets on machines with limited RAM
226
+
- **Faster startup**: Begin training immediately without waiting for full dataset download
227
+
- **Efficient for remote datasets**: Stream directly from Hugging Face Hub without local storage
228
+
229
+
### When to Use Streaming
230
+
231
+
Use streaming mode when:
232
+
233
+
- Your dataset is very large (hundreds of GB or TB)
234
+
- Available memory is limited compared to dataset size
235
+
- You want to start training quickly without downloading the full dataset
236
+
- You're experimenting with a subset of a large dataset
237
+
238
+
Avoid streaming when:
239
+
240
+
- Your dataset is small enough to fit comfortably in memory
241
+
- You need random access to samples (e.g., for certain sampling strategies)
242
+
- You need to know the exact dataset length upfront
243
+
- Training requires multiple passes with different orderings
244
+
245
+
### How to Enable Streaming
246
+
247
+
For `ColumnMappedTextInstructionDataset`, use the streaming variant by changing the class to `ColumnMappedTextInstructionIterableDataset`:
When using streaming datasets, be aware of these limitations:
277
+
278
+
1. **No random access**: You cannot use `dataset[index]` to access specific samples. Streaming datasets only support iteration.
279
+
280
+
2. **No length information**: The `len(dataset)` operation is not available. You cannot determine the total number of samples upfront.
281
+
282
+
3. **Single-pass iteration**: Each iteration consumes the stream. To iterate multiple times, you need to recreate the dataset or use the `repeat_on_exhaustion` parameter.
283
+
284
+
4. **Limited shuffling**: Shuffling is done with a buffer (not the entire dataset), which may not provide perfect randomization.
285
+
286
+
### Distributed Training with Streaming
287
+
288
+
Streaming datasets support distributed training through sharding:
289
+
290
+
```python
291
+
from nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset import (
0 commit comments