improve code efficiency to handle large amount of data#158
improve code efficiency to handle large amount of data#158JoachimPiret wants to merge 6 commits intoODINN-SciML:mainfrom
Conversation
… : init() and mapSplitsToDataset()
…c-region) as well as the possibility to have randomness and different sampling from sampling to sampling in set_train_test_split(). assign_train_test_indices(self,train_indices, test_indices, test_size) is defined to update dataloader with the values of the selected test/train divisions after 10 sampling based on subregion.
…me() between csv and parquet
albangossard
left a comment
There was a problem hiding this comment.
Hi @JoachimPiret, thanks for opening this PR! The changes in mapSplitsToDataset will probably improve the performance on large datasets, that's great!
There are a few things that need to be fixed.
Note that I didn't run the code since the execution of the notebooks is currently failing in the CI. It looks like there is a bug to fix.
|
|
||
| return iter(self.train_indices), iter(self.test_indices) | ||
|
|
||
| def assign_train_test_indices(self,train_indices, test_indices, test_size): |
There was a problem hiding this comment.
What is the purpose of this function? I don't see it being called in the code and its description doesn't match with what it does. Is it intended? Or is it a WIP?
There was a problem hiding this comment.
When I do the sampling to create an training set and a test set based on "C_region", I'm in fact sampling N times. Since the 12 regions are not homogeneous (some containes more SMB points than other), the sampling is performed N times and then the sampling with the number of points in test/train sets the closest to ts/1-ts is selected.
Each time I do the sampling; train_indices and test_indices are assigned to the dataloader object, ecrasing the previous values. So after the N sampling, the assigned indices corresponds to the last sampling.
I use this function to assign the indices to dataloader corresponding to the selected sampling.
dataloader = mbm.dataloader.DataLoader(cfg, data=data) N = 10 df = pd.DataFrame(index = [f'sampling_{k}' for k in range(1,N+1)],columns = ['percent_train','train_region','train_indices','test_indices']) ts = 0.3 ; tf = 'group-c_region'# 'group-rgi' or 'group-c_region'# or 'group-meas-id' for i in range(N):
train_itr, test_itr = dataloader.set_train_test_split(test_size=ts,type_fold = tf,randomness = True) #"group-meas-id"; "group-rgi"; "group-c_region"
train_indices, test_indices = list(train_itr), list(test_itr)
df_X_train = data.iloc[train_indices]
y_train = df_X_train['POINT_BALANCE'].values
df_X_test = data.iloc[test_indices]
y_test = df_X_test['POINT_BALANCE'].values
`train_indices = chosen_df.train_indices.item()
test_indices = chosen_df.test_indices.item()
df_X_train = data.iloc[train_indices]
y_train = df_X_train['POINT_BALANCE'].values
df_X_test = data.iloc[test_indices]
y_test = df_X_test['POINT_BALANCE'].values
dataloader.assign_train_test_indices(train_indices, test_indices, ts)`
There was a problem hiding this comment.
Ok I understand better thanks!
Your function assign_train_test_indices allows you to set the indices, so it does not have to do anything with the sampling being performed several times. Hence I suggest that you move the sentences which mention this to a note since this can be confusing for the user.
…nt for large dataset, output format to csv and parquet
…reate_group_kfold_splits() to cross-validate on subregion
albangossard
left a comment
There was a problem hiding this comment.
A few minor things to fix, otherwise this looks good to me :)
| try: | ||
| regions = train_data["C_REGION"].values | ||
| except: | ||
| regions = type(np.array([])) |
There was a problem hiding this comment.
More Pythonic than a try/except which can hide hidden bugs
| try: | |
| regions = train_data["C_REGION"].values | |
| except: | |
| regions = type(np.array([])) | |
| regions = train_data["C_REGION"].values if "C_REGION" in train_data.columns else np.array([]) |
|
|
||
| return iter(self.train_indices), iter(self.test_indices) | ||
|
|
||
| def assign_train_test_indices(self,train_indices, test_indices, test_size): |
There was a problem hiding this comment.
Ok I understand better thanks!
Your function assign_train_test_indices allows you to set the indices, so it does not have to do anything with the sampling being performed several times. Hence I suggest that you move the sentences which mention this to a note since this can be confusing for the user.
| Dividing train and test ensemble based on subregion require to make the sampling N times and then choose the | ||
| train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as | ||
| self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations | ||
| and not those of the train-test division chosen after comparing to the 70-30 repartition. | ||
| This function aims to correct this by reassigning the indices of the chosen sampling. |
There was a problem hiding this comment.
| Dividing train and test ensemble based on subregion require to make the sampling N times and then choose the | |
| train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as | |
| self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations | |
| and not those of the train-test division chosen after comparing to the 70-30 repartition. | |
| This function aims to correct this by reassigning the indices of the chosen sampling. | |
| Assign `train_indices`, `test_indices` as well as `test_size` attributes of the object. | |
| Note: | |
| This can be useful when you divide the train and test ensembles based on subregion since this requires to make the sampling N times and then choose the | |
| train-test division closest to the 70-30 repartition. At each iteration the Dataloader object is redifined as well as | |
| self.train_indices and self.test_indices meaning that the information in the Dataloader object are those of the last iterations | |
| and not those of the train-test division chosen after comparing to the 70-30 repartition. | |
| This function aims to correct this by reassigning the indices of the chosen sampling. |
Dataset.py : improve code efficiency of two functions of class AggregatedDataset() : init() and mapSplitsToDataset()
transformtomonthly.py : allows to record dataframe in parquet format in addition to csv format
dataloader.py : add possibility to divide between test and train absed on subregion (c-region) as well as the possibility to have randomness and different sampling from sampling to sampling in set_train_test_split(). assign_train_test_indices(self,train_indices, test_indices, test_size) is defined to update dataloader with the values of the selected test/train divisions after 10 sampling based on subregion.