EFS takes a fundamentally different approach to feature selection: instead of evaluating features individually, it trains many models on random feature subsets, then uses ensemble optimization to find the best combination of models. Features that appear in the winning ensemble are selected. This diversity-driven approach is especially effective for high-dimensional datasets where individual feature rankings may be unstable.
-
Generate random feature subsets. EFS creates
n_subsets(default 100) random subsets, each containingsubset_size(default 30) features drawn from the full feature set. -
Train a model per subset. For each subset, a
GridSearchCVtunes and trains the chosen model (CatBoost, XGBoost, RandomForest, or ExtraTrees). Cross-validated predictions are collected for every training sample. -
Build a model table. Each trained model is recorded along with its CV performance, the features it used (excluding those with zero importance), and its out-of-fold predictions. Models are sorted by performance.
-
Ensemble scan (hill-climbing). Starting from the single best model, the module incrementally adds the next-best model and computes the ensemble performance (averaged predictions across models). This scan identifies the number of top models that, when ensembled, give the best combined score.
-
Ensemble optimization with replacement. Starting from the models found in the scan, the optimizer iteratively tests adding each of the top
max_n_modelsmodels (with replacement) to the ensemble. At each iteration, the model that improves ensemble performance the most is added. The process repeats for up tomax_n_iterationsor until no improvement is found. -
Feature aggregation. The final optimized ensemble is a weighted collection of models (weights = number of times each model appears). The union of all features used by the ensemble models becomes the selected feature set. Feature importance is reported as counts (how many times a feature appeared across ensemble models) and relative counts (counts / total models).
| Parameter | Default | Description |
|---|---|---|
model |
"" (auto) |
Model name -- CatBoost, XGB, RandomForest, or ExtraTrees |
subset_size |
30 |
Number of features per random subset |
n_subsets |
100 |
Number of random subsets to create |
cv |
5 |
Cross-validation folds |
max_n_iterations |
50 |
Iterations for ensemble optimization |
max_n_models |
30 |
Maximum models to consider in optimization |
EFS is ideal when:
- The dataset is high-dimensional (hundreds to thousands of features) and individual feature rankings are noisy or inconsistent.
- You want a diversity-driven selection that captures complementary sets of features rather than just the top-ranked ones.
- Compute resources are available for training many models in parallel.
- Computationally heavy:
n_subsetsmodels are trained, each with a grid search. With 100 subsets and a 4-parameter grid this can mean thousands of model fits. - The random subset generation means results are seed-dependent. Different seeds may produce different feature sets, though the ensemble optimization helps stabilize this.
- Does not produce scores or predictions in the standard format (scores and predictions DataFrames are empty); it primarily returns feature counts as importance measures.
- Does not support time-to-event targets.