Skip to content

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

@wilmar-thomas

Description

@wilmar-thomas

Describe the bug

I don't know if this is the intended behaviour or not, but when setting train_or_test parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:

        if self.train_or_test.lower() == 'test':
            # keeping the same naming convenetion as to not add complexit later on
            self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
                                                                                                                                self.y,
                                                                                                                                self.sample_weight,
                                                                                                                                test_size=0.3,
                                                                                                                                random_state=self.random_state,
                                                                                                                                stratify=self.stratify)
            self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)

However, X_boruta_test is not used anywhere else, in fact the whole dataset X is used to derive feature importance, regardless of chosen train_or_test, cf. lines 856 and 873 for importance_measure == 'shap' :

self.shap_values = np.array(explainer.shap_values(self.X_boruta))
self.shap_values = explainer.shap_values(self.X_boruta)

and line 815 for importance_measure == 'perm' :

perm_importances_ = permutation_importance(self.model, self.X, self.y, scoring='f1')

While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.

Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?

Thank you for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions