Skip to content

Commit fbc702a

Browse files
committed
Update docs/guides/dataset-overview.md
1 parent 9f81989 commit fbc702a

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed

docs/guides/dataset-overview.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ dataset:
4444
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
4545
- Notes:
4646
- For tokenizers with chat templates and answer-only loss, you may set `answer_only_loss_mask: true` and provide `start_of_turn_token`.
47+
- Supports streaming mode for large datasets (see [Streaming Datasets](#streaming-datasets) section below).
4748
- Example YAML:
4849
```yaml
4950
dataset:
@@ -212,6 +213,146 @@ dataset:
212213
```
213214
See the detailed [pretraining guide](llm/pretraining.md), which uses MegatronPretraining data.
214215

216+
## Streaming Datasets
217+
218+
Streaming datasets enable processing very large datasets without loading them entirely into memory. This is particularly useful when working with datasets that exceed available RAM or when you want to start training immediately without waiting for the full dataset to download.
219+
220+
### What Are Streaming Datasets?
221+
222+
Streaming datasets load and process data incrementally, one batch at a time, rather than loading the entire dataset into memory upfront. This approach:
223+
224+
- **Reduces memory footprint**: Only the current batch resides in memory
225+
- **Enables training on massive datasets**: Process terabyte-scale datasets on machines with limited RAM
226+
- **Faster startup**: Begin training immediately without waiting for full dataset download
227+
- **Efficient for remote datasets**: Stream directly from Hugging Face Hub without local storage
228+
229+
### When to Use Streaming
230+
231+
Use streaming mode when:
232+
233+
- Your dataset is very large (hundreds of GB or TB)
234+
- Available memory is limited compared to dataset size
235+
- You want to start training quickly without downloading the full dataset
236+
- You're experimenting with a subset of a large dataset
237+
238+
Avoid streaming when:
239+
240+
- Your dataset is small enough to fit comfortably in memory
241+
- You need random access to samples (e.g., for certain sampling strategies)
242+
- You need to know the exact dataset length upfront
243+
- Training requires multiple passes with different orderings
244+
245+
### How to Enable Streaming
246+
247+
For `ColumnMappedTextInstructionDataset`, use the streaming variant by changing the class to `ColumnMappedTextInstructionIterableDataset`:
248+
249+
```yaml
250+
dataset:
251+
_target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset
252+
path_or_dataset_id: Muennighoff/natural-instructions
253+
split: train
254+
column_mapping:
255+
context: definition
256+
question: inputs
257+
answer: targets
258+
answer_only_loss_mask: true
259+
start_of_turn_token: "<|assistant|>"
260+
```
261+
262+
For Hugging Face datasets loaded directly, set `streaming=True`:
263+
264+
```python
265+
from datasets import load_dataset
266+
267+
# Non-streaming (loads entire dataset into memory)
268+
dataset = load_dataset("large-dataset/corpus", split="train", streaming=False)
269+
270+
# Streaming (loads data incrementally)
271+
dataset = load_dataset("large-dataset/corpus", split="train", streaming=True)
272+
```
273+
274+
### Streaming Limitations
275+
276+
When using streaming datasets, be aware of these limitations:
277+
278+
1. **No random access**: You cannot use `dataset[index]` to access specific samples. Streaming datasets only support iteration.
279+
280+
2. **No length information**: The `len(dataset)` operation is not available. You cannot determine the total number of samples upfront.
281+
282+
3. **Single-pass iteration**: Each iteration consumes the stream. To iterate multiple times, you need to recreate the dataset or use the `repeat_on_exhaustion` parameter.
283+
284+
4. **Limited shuffling**: Shuffling is done with a buffer (not the entire dataset), which may not provide perfect randomization.
285+
286+
### Distributed Training with Streaming
287+
288+
Streaming datasets support distributed training through sharding:
289+
290+
```python
291+
from nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset import (
292+
ColumnMappedTextInstructionIterableDataset
293+
)
294+
295+
dataset = ColumnMappedTextInstructionIterableDataset(
296+
path_or_dataset_id="large-dataset/corpus",
297+
column_mapping={"question": "input", "answer": "output"},
298+
tokenizer=tokenizer,
299+
)
300+
301+
# Shard the dataset across workers
302+
dataset = dataset.shard(num_shards=8, index=worker_id)
303+
304+
# Enable shuffling with a buffer
305+
dataset = dataset.shuffle(buffer_size=10000, seed=42)
306+
307+
# Set epoch for deterministic shuffling across epochs
308+
dataset.set_epoch(epoch_num)
309+
```
310+
311+
### Performance Considerations
312+
313+
**Memory vs. Speed Trade-offs**:
314+
- Streaming reduces memory usage but may be slower than in-memory datasets
315+
- Network latency can impact streaming performance for remote datasets
316+
- Use local caching when repeatedly accessing the same remote dataset
317+
318+
**Buffer Size for Shuffling**:
319+
- Larger buffers provide better randomization but use more memory
320+
- A buffer size of 10,000-100,000 samples is typically a good balance
321+
- For perfect shuffling, you need a buffer size equal to the dataset size (defeating the purpose of streaming)
322+
323+
**Prefetching**:
324+
- Most streaming implementations prefetch data in the background
325+
- This helps hide network latency and keeps GPUs busy
326+
- Adjust prefetch settings based on your network speed and batch size
327+
328+
### Example: Streaming a Large Dataset
329+
330+
Here's a complete example of using streaming for a large instruction-tuning dataset:
331+
332+
```yaml
333+
dataset:
334+
_target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset
335+
path_or_dataset_id: HuggingFaceH4/ultrachat_200k
336+
split: train_sft
337+
column_mapping:
338+
question: prompt
339+
answer: completion
340+
answer_only_loss_mask: true
341+
start_of_turn_token: "<|assistant|>"
342+
repeat_on_exhaustion: true # Automatically restart when stream ends
343+
344+
dataloader:
345+
_target_: torchdata.stateful_dataloader.StatefulDataLoader
346+
batch_size: 4
347+
num_workers: 4
348+
```
349+
350+
This configuration:
351+
- Streams the dataset without loading it fully into memory
352+
- Automatically repeats when the stream is exhausted
353+
- Uses multiple workers for efficient data loading
354+
- Applies answer-only loss masking during tokenization
355+
215356
## Packed Sequence Support
216357
To reduce padding and improve throughput with variable-length sequences:
217358
```yaml

0 commit comments

Comments
 (0)