Skip to content

Dimension errors when using sklearn OneHotEncoder with min_frequency parameter #545

@dclaz

Description

@dclaz

The documentation suggests that the sklearn OneHotEncoder should be a viable transformation when using the MimicExplainer, but I'm getting errors if I use it and set the min_frequency parameter to remove category levels with low counts.

If I set up my data preprocessor like this

image

(where I have ~7 categorical features, each with many levels)

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)       
    ],
    remainder="drop",
)

I get the following error
image

However, if I set a different transformer for each categorical feature, the Explainer works, albeit with a Many to one/many maps found in input warning and produces outputs that don't really make sense (Half the features end up having very, very similar SHAP values).

image

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

# Construct list of categorical transformers 
categorical_treatments_list = [(feature, categorical_transformer, [feature]) for feature in categorical_features]

# Construct the data preprocessor
data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        *categorical_treatments_list
    ],
    remainder="drop",
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions