Skip to content

OOM on preprocessing dataset with large number of documents #34

@RaymondLi0

Description

@RaymondLi0

When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.

The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.

Traceback (most recent call last):
  File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
    main()
  File "Megatron-LM/tools/preprocess_data.py", line 224, in main
    builders[key].finalize(output_idx_files[key])
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
    index.write(self._sizes, self._doc_idx)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
    pointers = self._get_pointers(sizes)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
    pointers.append(address)
MemoryError

The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions