-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Description
When initializing a BorutaShap object with an unfitted model (e.g., RandomForestClassifier or RandomForestRegressor), the check_model method immediately tries to check the presence of the feature_importances_ attribute. However, this attribute is only available after the model is fitted, leading to the following issue:
Unfitted Models: If the base model is not yet fitted, calling check_model triggers a NotFittedError. This results in the attribute check_feature_importance being set to False. When the importance_measure is set to 'gini', this later raises:
AttributeError('Model must contain the feature_importances_ method to use Gini try Shap instead')Behavior Explanation
-
No Model Provided: When no model is passed during initialization, check_model internally initializes a new Random Forest model. Since these models are pre-defined and known to support gini importance measures, the check is skipped. This is why the issue does not occur when a model is not explicitly provided.
-
Custom Model Passed: If a custom Random Forest model is passed (even if it's of the same type as the default), the immediate check for
feature_importances_fails because the model has not been fitted, causing the error.
Steps to Reproduce
from borutashap import BorutaShap
from sklearn.ensemble import RandomForestClassifier
# Pass an unfitted model
model = RandomForestClassifier()
boruta_shap = BorutaShap(model=model, importance_measure='gini')Suggested Solutions
Below are three potential solutions to address this issue:
- Pre-fit the Model on a Simple Dataset: Before checking the presence of the feature_importances_ attribute, clone the provided model and fit it on a predefined small dataset (aligned with the problem type: classification or regression). This ensures minimal performance impact and requires little modification to the existing codebase.
Example:
from sklearn.base import clone
from sklearn.datasets import make_classification, make_regression
def prefit_model(model, classification=True):
if classification:
X, y = make_classification(n_samples=10, n_features=5)
else:
X, y = make_regression(n_samples=10, n_features=5)
model_clone = clone(model)
model_clone.fit(X, y)
return model_cloneThis can then be used during check_model to validate the feature_importances_ attribute.
-
Defer
feature_importances_Check Until Needed: Postpone the check for feature_importances_ until the first usage of the feature importance, typically after the model's fit method has been called. However, this could delay the discovery of compatibility issues, especially for models with long training times, potentially leading to user frustration. -
Explicitly Require Fitted Models: Add a model fit check (e.g., using
sklearn.utils.validation.check_is_fitted) before attempting to accessfeature_importances_. This would provide immediate feedback to the user if the model is not ready. While practical, this may not align well with typical usage patterns.
Preferred Solution
From my perspective, Option (1) (pre-fitting the model on a simple dataset) strikes a balance between performance, user experience, and minimal codebase changes.
Additional Notes
The current behavior of bypassing the feature_importances_ check for internally initialized models is intentional and appropriate. However, it highlights why user-provided models, even of the same type (e.g., RandomForestClassifier), fail if not pre-fitted. This discrepancy should be addressed to provide a consistent and user-friendly experience.
@Ekeany Let me know if more details or assistance are needed! π