Skip to content

improve code efficiency to handle large amount of data#158

Open
JoachimPiret wants to merge 6 commits intoODINN-SciML:mainfrom
JoachimPiret:main
Open

improve code efficiency to handle large amount of data#158
JoachimPiret wants to merge 6 commits intoODINN-SciML:mainfrom
JoachimPiret:main

Conversation

@JoachimPiret
Copy link

Dataset.py : improve code efficiency of two functions of class AggregatedDataset() : init() and mapSplitsToDataset()
transformtomonthly.py : allows to record dataframe in parquet format in addition to csv format
dataloader.py : add possibility to divide between test and train absed on subregion (c-region) as well as the possibility to have randomness and different sampling from sampling to sampling in set_train_test_split(). assign_train_test_indices(self,train_indices, test_indices, test_size) is defined to update dataloader with the values of the selected test/train divisions after 10 sampling based on subregion.

…c-region) as well as the possibility to have randomness and different sampling from sampling to sampling in set_train_test_split(). assign_train_test_indices(self,train_indices, test_indices, test_size) is defined to update dataloader with the values of the selected test/train divisions after 10 sampling based on subregion.
@albangossard albangossard self-requested a review January 8, 2026 17:21
@albangossard albangossard added performance Improve computational performance of the model enhancement New feature or request labels Jan 8, 2026
Copy link
Member

@albangossard albangossard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JoachimPiret, thanks for opening this PR! The changes in mapSplitsToDataset will probably improve the performance on large datasets, that's great!
There are a few things that need to be fixed.
Note that I didn't run the code since the execution of the notebooks is currently failing in the CI. It looks like there is a bug to fix.


return iter(self.train_indices), iter(self.test_indices)

def assign_train_test_indices(self,train_indices, test_indices, test_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this function? I don't see it being called in the code and its description doesn't match with what it does. Is it intended? Or is it a WIP?

Copy link
Author

@JoachimPiret JoachimPiret Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I do the sampling to create an training set and a test set based on "C_region", I'm in fact sampling N times. Since the 12 regions are not homogeneous (some containes more SMB points than other), the sampling is performed N times and then the sampling with the number of points in test/train sets the closest to ts/1-ts is selected.
Each time I do the sampling; train_indices and test_indices are assigned to the dataloader object, ecrasing the previous values. So after the N sampling, the assigned indices corresponds to the last sampling.
I use this function to assign the indices to dataloader corresponding to the selected sampling.

dataloader = mbm.dataloader.DataLoader(cfg, data=data) N = 10 df = pd.DataFrame(index = [f'sampling_{k}' for k in range(1,N+1)],columns = ['percent_train','train_region','train_indices','test_indices']) ts = 0.3 ; tf = 'group-c_region'# 'group-rgi' or 'group-c_region'# or 'group-meas-id' for i in range(N):

  train_itr, test_itr = dataloader.set_train_test_split(test_size=ts,type_fold = tf,randomness = True) #"group-meas-id"; "group-rgi"; "group-c_region"
  train_indices, test_indices = list(train_itr), list(test_itr)
  df_X_train = data.iloc[train_indices]
  y_train = df_X_train['POINT_BALANCE'].values
  df_X_test = data.iloc[test_indices]
  y_test = df_X_test['POINT_BALANCE'].values

`train_indices = chosen_df.train_indices.item()
test_indices = chosen_df.test_indices.item()
df_X_train = data.iloc[train_indices]
y_train = df_X_train['POINT_BALANCE'].values
df_X_test = data.iloc[test_indices]
y_test = df_X_test['POINT_BALANCE'].values

dataloader.assign_train_test_indices(train_indices, test_indices, ts)`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand better thanks!
Your function assign_train_test_indices allows you to set the indices, so it does not have to do anything with the sampling being performed several times. Hence I suggest that you move the sentences which mention this to a note since this can be confusing for the user.

…nt for large dataset, output format to csv and parquet
…reate_group_kfold_splits() to cross-validate on subregion
Copy link
Member

@albangossard albangossard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things to fix, otherwise this looks good to me :)

Comment on lines +267 to +270
try:
regions = train_data["C_REGION"].values
except:
regions = type(np.array([]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More Pythonic than a try/except which can hide hidden bugs

Suggested change
try:
regions = train_data["C_REGION"].values
except:
regions = type(np.array([]))
regions = train_data["C_REGION"].values if "C_REGION" in train_data.columns else np.array([])


return iter(self.train_indices), iter(self.test_indices)

def assign_train_test_indices(self,train_indices, test_indices, test_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand better thanks!
Your function assign_train_test_indices allows you to set the indices, so it does not have to do anything with the sampling being performed several times. Hence I suggest that you move the sentences which mention this to a note since this can be confusing for the user.

Comment on lines +129 to +133
Dividing train and test ensemble based on subregion require to make the sampling N times and then choose the
train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as
self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations
and not those of the train-test division chosen after comparing to the 70-30 repartition.
This function aims to correct this by reassigning the indices of the chosen sampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Dividing train and test ensemble based on subregion require to make the sampling N times and then choose the
train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as
self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations
and not those of the train-test division chosen after comparing to the 70-30 repartition.
This function aims to correct this by reassigning the indices of the chosen sampling.
Assign `train_indices`, `test_indices` as well as `test_size` attributes of the object.
Note:
This can be useful when you divide the train and test ensembles based on subregion since this requires to make the sampling N times and then choose the
train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as
self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations
and not those of the train-test division chosen after comparing to the 70-30 repartition.
This function aims to correct this by reassigning the indices of the chosen sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance Improve computational performance of the model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants