[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test'

Describe the bug
================
I don't know if this is the intended behaviour or not, but when setting ```train_or_test``` parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:
```
        if self.train_or_test.lower() == 'test':
            # keeping the same naming convenetion as to not add complexit later on
            self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
                                                                                                                                self.y,
                                                                                                                                self.sample_weight,
                                                                                                                                test_size=0.3,
                                                                                                                                random_state=self.random_state,
                                                                                                                                stratify=self.stratify)
            self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)
```
However, ```X_boruta_test``` is not used anywhere else, in fact the whole dataset ````X```` is used to derive feature importance, regardless of chosen  ```train_or_test```, cf. lines 856 and 873 for ```importance_measure == 'shap'``` :
```
self.shap_values = np.array(explainer.shap_values(self.X_boruta))
self.shap_values = explainer.shap_values(self.X_boruta)
```
and line 815 for ```importance_measure == 'perm'``` :
```
perm_importances_ = permutation_importance(self.model, self.X, self.y, scoring='f1')
```

While for SHAP this may not constitute a big difference, according to this [post](https://stats.stackexchange.com/questions/615290/if-feature-importance-is-only-computed-based-on-training-set-does-it-mean-one-s), this does not correspond to what is recommended [here](https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html#introduction-to-test-vs.training-data) for permutation feature importance.


Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?

Thank you for your help.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

Describe the bug

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

Description

Describe the bug

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions