Skip to content

[Feature Request]: Larger than memory datasets. #210

@JonathanSchmidt1

Description

@JonathanSchmidt1

Problem

Now that multi-gpu training is working we are very interested in training on some larger crystal structure datasets. However, the datasets do not fit into the RAM. It would be great if it would be possible to either have to only load a partial dataset on each ddp node or be able to load the features on the fly to make large scale training possible. I assume the LMDB datasets that the OCP and MATSCIML are using should work for that. Ps: thank you for the pytorch lightning implementation

Proposed Solution

Add an option to save and load data to/from an LMDB database.

Alternatives

Examples would be here
https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb
or here https://github.com/IntelLabs/matsciml/tree/main/matsciml/datasets

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

dataData loading and processing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions