Skip to content

[BUG] BorutaShap Initialization with Unfitted ModelsΒ #136

@xuxu-wei

Description

@xuxu-wei

Description

When initializing a BorutaShap object with an unfitted model (e.g., RandomForestClassifier or RandomForestRegressor), the check_model method immediately tries to check the presence of the feature_importances_ attribute. However, this attribute is only available after the model is fitted, leading to the following issue:

Unfitted Models: If the base model is not yet fitted, calling check_model triggers a NotFittedError. This results in the attribute check_feature_importance being set to False. When the importance_measure is set to 'gini', this later raises:

AttributeError('Model must contain the feature_importances_ method to use Gini try Shap instead')

Behavior Explanation

  • No Model Provided: When no model is passed during initialization, check_model internally initializes a new Random Forest model. Since these models are pre-defined and known to support gini importance measures, the check is skipped. This is why the issue does not occur when a model is not explicitly provided.

  • Custom Model Passed: If a custom Random Forest model is passed (even if it's of the same type as the default), the immediate check for feature_importances_ fails because the model has not been fitted, causing the error.

Steps to Reproduce

from borutashap import BorutaShap
from sklearn.ensemble import RandomForestClassifier

# Pass an unfitted model
model = RandomForestClassifier()
boruta_shap = BorutaShap(model=model, importance_measure='gini')

Suggested Solutions

Below are three potential solutions to address this issue:

  1. Pre-fit the Model on a Simple Dataset: Before checking the presence of the feature_importances_ attribute, clone the provided model and fit it on a predefined small dataset (aligned with the problem type: classification or regression). This ensures minimal performance impact and requires little modification to the existing codebase.

Example:

from sklearn.base import clone
from sklearn.datasets import make_classification, make_regression

def prefit_model(model, classification=True):
    if classification:
        X, y = make_classification(n_samples=10, n_features=5)
    else:
        X, y = make_regression(n_samples=10, n_features=5)
    model_clone = clone(model)
    model_clone.fit(X, y)
    return model_clone

This can then be used during check_model to validate the feature_importances_ attribute.

  1. Defer feature_importances_ Check Until Needed: Postpone the check for feature_importances_ until the first usage of the feature importance, typically after the model's fit method has been called. However, this could delay the discovery of compatibility issues, especially for models with long training times, potentially leading to user frustration.

  2. Explicitly Require Fitted Models: Add a model fit check (e.g., using sklearn.utils.validation.check_is_fitted) before attempting to access feature_importances_. This would provide immediate feedback to the user if the model is not ready. While practical, this may not align well with typical usage patterns.

Preferred Solution

From my perspective, Option (1) (pre-fitting the model on a simple dataset) strikes a balance between performance, user experience, and minimal codebase changes.

Additional Notes

The current behavior of bypassing the feature_importances_ check for internally initialized models is intentional and appropriate. However, it highlights why user-provided models, even of the same type (e.g., RandomForestClassifier), fail if not pre-fitted. This discrepancy should be addressed to provide a consistent and user-friendly experience.

@Ekeany Let me know if more details or assistance are needed! 😊

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions