Hanaol/dataloader 2 #60

hanaol · 2026-01-16T19:23:39Z

Problem

The current data loader has the following issues:

It relies on separate scripts within the data registry for handling the MP and QM9 datasets, even though they share common utilities.
It performs the train–validation split on the fly, without explicit support for a held-out test set.

Solution

The data loader has been refactored to address these issues.
The data module and its configuration are now defined in the configuration files and loaded with Hydra, reducing data-loader boilerplate and improving modularity and reproducibility.

Note: The root key in the configuration should point to a directory with the following structure:

root/
├── mp_filelist.txt   # List of sample indices
├── split.json        # Dictionary with 'train', 'validation', and 'test' keys,
│                     # each mapped to a list of indices referring to mp_filelist.txt or qm9_filelist.txt
├── data/             
└── label/

hanaol · 2026-01-16T22:04:55Z

@forklady42 This is a reminder to discuss dataloader shuffle and whether we need to explicitly set a DistributedSampler.

forklady42 · 2026-01-20T17:19:49Z

This is a reminder to discuss dataloader shuffle and whether we need to explicitly set a DistributedSampler.

We can discuss further but my assumption is that we should start by using Lightning's built-in default DistributedSampler rather than explicitly setting shuffle or writing a custom DistributedSampler. If we need up with questions/concerns about the sampler, then we can write our own custom logic.

forklady42

Left some comments to address.

How have you tested these changes thus far?

pyproject.toml

src/electrai/dataloader/split.py

src/electrai/dataloader/utils.py

forklady42 · 2026-01-21T17:06:59Z

src/electrai/dataloader/dataset.py

+            # note: no sampler, so all devices will get full set
+        )
+
+    def test_dataloader(self):


This doesn't seem to be used anywhere and would raise a KeyError because test is not included in subsets.

@forklady42 I didn't run into any issues. Also, the new commit should handle it more safely in the setup function.

For reference, the code for the test dataset is here: #61

What error are you running into? Can you share it here?

@hanaol the issue is that "test" is not included in the splits here. Because of that, setting self.test_set on line 63 will fail.

The root issue is that the test dataset code is in a separate PR. Keeping PRs small is good! However, it's best to separate them so that the first won't have an error without the other. In this case, that would mean including test_dataloader() and elif stage == "test" in the other PR. Because I expect these PRs will be merged around the same time, I am not considering this a blocker. It's good to keep this in mind in the future though.

src/electrai/dataloader/dataset.py

src/electrai/dataloader/utils.py

src/electrai/dataloader/dataset.py

forklady42

A couple more small comments but I realized you might still be working on this, so didn't get through all of the code.

src/electrai/dataloader/dataset.py

src/electrai/dataloader/split.py

hanaol

made some changes.

forklady42

Looks much better. Thanks for addressing the issues!

forklady42 · 2026-01-26T17:28:24Z

src/electrai/dataloader/utils.py

+    if category == "mp":
+        data, label = load_chgcar(root, index)
+    elif category == "qm9":
+        data, label = load_npy(root, index)


Nit: would be good to explicitly throw an error if the category isn't mp or qm9, i.e.

else: raise ValueError(f"Unknown category: {category}. Supported: 'mp', 'qm9'")

Hananeh Oliaei added 3 commits January 10, 2026 17:09

new dataloader

8c4100d

updated train script

e75ef83

added lightning script

fffc85c

hanaol mentioned this pull request Jan 16, 2026

test features #61

Open

hanaol mentioned this pull request Jan 16, 2026

Add support for distributed training (DDP) #57

Merged

forklady42 requested changes Jan 21, 2026

View reviewed changes

Hananeh Oliaei added 3 commits January 23, 2026 12:24

reverted lint exceptions

ea0364f

updated dataset module

09dd77f

handled random seed for data split

7da5fa4

forklady42 reviewed Jan 23, 2026

View reviewed changes

src/electrai/dataloader/dataset.py Outdated Show resolved Hide resolved

src/electrai/dataloader/split.py Outdated Show resolved Hide resolved

Hananeh Oliaei added 2 commits January 24, 2026 00:04

updated some features

a56b770

updated train script

0d4f9af

hanaol commented Jan 26, 2026

View reviewed changes

forklady42 approved these changes Jan 26, 2026

View reviewed changes

hanaol mentioned this pull request Jan 27, 2026

Updated mp training scripts #56

Closed

Hanaol/dataloader 2 #60

Are you sure you want to change the base?

Hanaol/dataloader 2 #60

Uh oh!

Conversation

hanaol commented Jan 16, 2026

Uh oh!

hanaol commented Jan 16, 2026

Uh oh!

forklady42 commented Jan 20, 2026

Uh oh!

forklady42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

forklady42 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

hanaol Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

forklady42 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

hanaol Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

forklady42 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

forklady42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hanaol left a comment

Choose a reason for hiding this comment

Uh oh!

forklady42 left a comment

Choose a reason for hiding this comment

Uh oh!

forklady42 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants