-
Notifications
You must be signed in to change notification settings - Fork 0
Hanaol/dataloader 2 #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@forklady42 This is a reminder to discuss dataloader |
We can discuss further but my assumption is that we should start by using Lightning's built-in default |
forklady42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments to address.
How have you tested these changes thus far?
| # note: no sampler, so all devices will get full set | ||
| ) | ||
|
|
||
| def test_dataloader(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be used anywhere and would raise a KeyError because test is not included in subsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@forklady42 I didn't run into any issues. Also, the new commit should handle it more safely in the setup function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, the code for the test dataset is here: #61
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error are you running into? Can you share it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanaol the issue is that "test" is not included in the splits here. Because of that, setting self.test_set on line 63 will fail.
The root issue is that the test dataset code is in a separate PR. Keeping PRs small is good! However, it's best to separate them so that the first won't have an error without the other. In this case, that would mean including test_dataloader() and elif stage == "test" in the other PR. Because I expect these PRs will be merged around the same time, I am not considering this a blocker. It's good to keep this in mind in the future though.
forklady42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more small comments but I realized you might still be working on this, so didn't get through all of the code.
hanaol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made some changes.
forklady42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better. Thanks for addressing the issues!
| if category == "mp": | ||
| data, label = load_chgcar(root, index) | ||
| elif category == "qm9": | ||
| data, label = load_npy(root, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would be good to explicitly throw an error if the category isn't mp or qm9, i.e.
else:
raise ValueError(f"Unknown category: {category}. Supported: 'mp', 'qm9'")
Problem
The current data loader has the following issues:
It relies on separate scripts within the data registry for handling the MP and QM9 datasets, even though they share common utilities.
It performs the train–validation split on the fly, without explicit support for a held-out test set.
Solution
The data loader has been refactored to address these issues.
The data module and its configuration are now defined in the configuration files and loaded with
Hydra, reducing data-loader boilerplate and improving modularity and reproducibility.Note: The root key in the configuration should point to a directory with the following structure: