OOM on preprocessing dataset with large number of documents

When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.

The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.

```
Traceback (most recent call last):
  File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
    main()
  File "Megatron-LM/tools/preprocess_data.py", line 224, in main
    builders[key].finalize(output_idx_files[key])
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
    index.write(self._sizes, self._doc_idx)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
    pointers = self._get_pointers(sizes)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
    pointers.append(address)
MemoryError
```

The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM on preprocessing dataset with large number of documents #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM on preprocessing dataset with large number of documents #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions