Skip to content

Conversation

@hanaol
Copy link
Collaborator

@hanaol hanaol commented Jan 16, 2026

Problem

The current data loader has the following issues:

  • It relies on separate scripts within the data registry for handling the MP and QM9 datasets, even though they share common utilities.

  • It performs the train–validation split on the fly, without explicit support for a held-out test set.

Solution

  • The data loader has been refactored to address these issues.

  • The data module and its configuration are now defined in the configuration files and loaded with Hydra, reducing data-loader boilerplate and improving modularity and reproducibility.

Note: The root key in the configuration should point to a directory with the following structure:

root/
├── mp_filelist.txt   # List of sample indices
├── split.json        # Dictionary with 'train', 'validation', and 'test' keys,
│                     # each mapped to a list of indices referring to mp_filelist.txt or qm9_filelist.txt
├── data/             
└── label/         

@hanaol hanaol mentioned this pull request Jan 16, 2026
@hanaol
Copy link
Collaborator Author

hanaol commented Jan 16, 2026

@forklady42 This is a reminder to discuss dataloader shuffle and whether we need to explicitly set a DistributedSampler.

@forklady42
Copy link
Collaborator

This is a reminder to discuss dataloader shuffle and whether we need to explicitly set a DistributedSampler.

We can discuss further but my assumption is that we should start by using Lightning's built-in default DistributedSampler rather than explicitly setting shuffle or writing a custom DistributedSampler. If we need up with questions/concerns about the sampler, then we can write our own custom logic.

Copy link
Collaborator

@forklady42 forklady42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments to address.

How have you tested these changes thus far?

# note: no sampler, so all devices will get full set
)

def test_dataloader(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used anywhere and would raise a KeyError because test is not included in subsets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forklady42 I didn't run into any issues. Also, the new commit should handle it more safely in the setup function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the code for the test dataset is here: #61

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error are you running into? Can you share it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanaol the issue is that "test" is not included in the splits here. Because of that, setting self.test_set on line 63 will fail.

The root issue is that the test dataset code is in a separate PR. Keeping PRs small is good! However, it's best to separate them so that the first won't have an error without the other. In this case, that would mean including test_dataloader() and elif stage == "test" in the other PR. Because I expect these PRs will be merged around the same time, I am not considering this a blocker. It's good to keep this in mind in the future though.

Copy link
Collaborator

@forklady42 forklady42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more small comments but I realized you might still be working on this, so didn't get through all of the code.

Copy link
Collaborator Author

@hanaol hanaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some changes.

Copy link
Collaborator

@forklady42 forklady42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better. Thanks for addressing the issues!

if category == "mp":
data, label = load_chgcar(root, index)
elif category == "qm9":
data, label = load_npy(root, index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would be good to explicitly throw an error if the category isn't mp or qm9, i.e.

else:
    raise ValueError(f"Unknown category: {category}. Supported: 'mp', 'qm9'")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants