improve code efficiency to handle large amount of data by JoachimPiret · Pull Request #158 · ODINN-SciML/MassBalanceMachine

JoachimPiret · 2026-01-08T16:52:08Z

Dataset.py : improve code efficiency of two functions of class AggregatedDataset() : init() and mapSplitsToDataset()
transformtomonthly.py : allows to record dataframe in parquet format in addition to csv format
dataloader.py : add possibility to divide between test and train absed on subregion (c-region) as well as the possibility to have randomness and different sampling from sampling to sampling in set_train_test_split(). assign_train_test_indices(self,train_indices, test_indices, test_size) is defined to update dataloader with the values of the selected test/train divisions after 10 sampling based on subregion.

… : init() and mapSplitsToDataset()

…c-region) as well as the possibility to have randomness and different sampling from sampling to sampling in set_train_test_split(). assign_train_test_indices(self,train_indices, test_indices, test_size) is defined to update dataloader with the values of the selected test/train divisions after 10 sampling based on subregion.

…me() between csv and parquet

albangossard

Hi @JoachimPiret, thanks for opening this PR! The changes in mapSplitsToDataset will probably improve the performance on large datasets, that's great!
There are a few things that need to be fixed.
Note that I didn't run the code since the execution of the notebooks is currently failing in the CI. It looks like there is a bug to fix.

massbalancemachine/data_processing/Dataset.py

massbalancemachine/dataloader/DataLoader.py

albangossard · 2026-01-13T10:20:51Z

massbalancemachine/dataloader/DataLoader.py

-
        return iter(self.train_indices), iter(self.test_indices)
+
+    def assign_train_test_indices(self,train_indices, test_indices, test_size):


What is the purpose of this function? I don't see it being called in the code and its description doesn't match with what it does. Is it intended? Or is it a WIP?

When I do the sampling to create an training set and a test set based on "C_region", I'm in fact sampling N times. Since the 12 regions are not homogeneous (some containes more SMB points than other), the sampling is performed N times and then the sampling with the number of points in test/train sets the closest to ts/1-ts is selected.
Each time I do the sampling; train_indices and test_indices are assigned to the dataloader object, ecrasing the previous values. So after the N sampling, the assigned indices corresponds to the last sampling.
I use this function to assign the indices to dataloader corresponding to the selected sampling.

dataloader = mbm.dataloader.DataLoader(cfg, data=data) N = 10 df = pd.DataFrame(index = [f'sampling_{k}' for k in range(1,N+1)],columns = ['percent_train','train_region','train_indices','test_indices']) ts = 0.3 ; tf = 'group-c_region'# 'group-rgi' or 'group-c_region'# or 'group-meas-id' for i in range(N):

train_itr, test_itr = dataloader.set_train_test_split(test_size=ts,type_fold = tf,randomness = True) #"group-meas-id"; "group-rgi"; "group-c_region" train_indices, test_indices = list(train_itr), list(test_itr) df_X_train = data.iloc[train_indices] y_train = df_X_train['POINT_BALANCE'].values df_X_test = data.iloc[test_indices] y_test = df_X_test['POINT_BALANCE'].values

`train_indices = chosen_df.train_indices.item()
test_indices = chosen_df.test_indices.item()
df_X_train = data.iloc[train_indices]
y_train = df_X_train['POINT_BALANCE'].values
df_X_test = data.iloc[test_indices]
y_test = df_X_test['POINT_BALANCE'].values

dataloader.assign_train_test_indices(train_indices, test_indices, ts)`

Ok I understand better thanks!
Your function assign_train_test_indices allows you to set the indices, so it does not have to do anything with the sampling being performed several times. Hence I suggest that you move the sentences which mention this to a note since this can be confusing for the user.

…nt for large dataset, output format to csv and parquet

…reate_group_kfold_splits() to cross-validate on subregion

albangossard

A few minor things to fix, otherwise this looks good to me :)

albangossard · 2026-01-20T14:01:27Z

massbalancemachine/dataloader/DataLoader.py

+        try:
+            regions = train_data["C_REGION"].values
+        except:
+            regions = type(np.array([]))


More Pythonic than a try/except which can hide hidden bugs

Suggested change

try:

regions = train_data["C_REGION"].values

except:

regions = type(np.array([]))

regions = train_data["C_REGION"].values if "C_REGION" in train_data.columns else np.array([])

albangossard · 2026-01-20T14:04:38Z

massbalancemachine/dataloader/DataLoader.py

-
        return iter(self.train_indices), iter(self.test_indices)
+
+    def assign_train_test_indices(self,train_indices, test_indices, test_size):


Ok I understand better thanks!
Your function assign_train_test_indices allows you to set the indices, so it does not have to do anything with the sampling being performed several times. Hence I suggest that you move the sentences which mention this to a note since this can be confusing for the user.

albangossard · 2026-01-20T14:06:45Z

massbalancemachine/dataloader/DataLoader.py

+        Dividing train and test ensemble based on subregion require to make the sampling N times and then choose the
+        train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as
+        self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations
+        and not those of the  train-test division chosen after comparing to the 70-30 repartition.
+        This function aims to correct this by reassigning the indices of the chosen sampling.


Suggested change

Dividing train and test ensemble based on subregion require to make the sampling N times and then choose the

train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as

self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations

and not those of the train-test division chosen after comparing to the 70-30 repartition.

This function aims to correct this by reassigning the indices of the chosen sampling.

Assign `train_indices`, `test_indices` as well as `test_size` attributes of the object.

Note:

This can be useful when you divide the train and test ensembles based on subregion since this requires to make the sampling N times and then choose the

train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as

self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations

and not those of the train-test division chosen after comparing to the 70-30 repartition.

This function aims to correct this by reassigning the indices of the chosen sampling.

massbalancemachine/data_processing/Dataset.py

JoachimPiret added 3 commits January 8, 2026 17:40

improve code efficiency of two functions of class AggregatedDataset()…

edce87d

… : init() and mapSplitsToDataset()

allows to record dataframe in parquet format in addition to csv format

aa896f7

albangossard self-requested a review January 8, 2026 17:21

albangossard added performance Improve computational performance of the model enhancement New feature or request labels Jan 8, 2026

adapation of dataset.py to choose output format of _get_output_filena…

ca8e672

…me() between csv and parquet

albangossard requested changes Jan 13, 2026

View reviewed changes

JoachimPiret added 2 commits January 14, 2026 14:52

Alban's feedback on PR : mapSplitsToDataset() and init() more efficie…

1a9e2e5

…nt for large dataset, output format to csv and parquet

Alban's feedback on PR : split on subregion added, modification of _c…

9bf9aa8

…reate_group_kfold_splits() to cross-validate on subregion

albangossard reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve code efficiency to handle large amount of data#158

improve code efficiency to handle large amount of data#158
JoachimPiret wants to merge 6 commits intoODINN-SciML:mainfrom
JoachimPiret:main

JoachimPiret commented Jan 8, 2026

Uh oh!

albangossard left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albangossard Jan 13, 2026

Uh oh!

JoachimPiret Jan 13, 2026 •

edited

Loading

Uh oh!

albangossard Jan 20, 2026

Uh oh!

albangossard left a comment

Uh oh!

albangossard Jan 20, 2026

Uh oh!

albangossard Jan 20, 2026

Uh oh!

albangossard Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return iter(self.train_indices), iter(self.test_indices)

		def assign_train_test_indices(self,train_indices, test_indices, test_size):

Conversation

JoachimPiret commented Jan 8, 2026

Uh oh!

albangossard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albangossard Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

JoachimPiret Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albangossard Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

albangossard left a comment

Choose a reason for hiding this comment

Uh oh!

albangossard Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

albangossard Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

albangossard Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoachimPiret Jan 13, 2026 •

edited

Loading