Skip to content

Commit db83d76

Browse files
solegalliGilles Verbockhavengverbock
authored
adds psi selection docs (#326)
* first draft of psi selection docs * Correct bug in PSI and first draft for documentation * add test on shuffled dataframe and documentation * Add info to index and fix unexpected identation * fix unexpected identation * fix unexpected identation * remove some bullets * add blank * add blank * add blank * set to one row * test bullets * adjust doc examples * test equations * Move the math def of PSI from docstring to documentation * correct typo * modifies api selection index to split tables and add psi class * reorganises order of selectors in index * minor edits to psi class docstrings * edits user guide index * updates drop psi user guide * first update of code examples * final edits to docs * fix style * address some TODO from the PR * Fine tune examples * Add plot distribution for case 1 * Add code and test for cdf plot * Add image distribution case 4 * add plot distribution PSI case 5 * add image distribution shift case 3 * Add code and comment for the PSI distribution shift * extend tests to make explicit diff between categ and list of categ * remove TODO * remove - to see if it passes doc test * remove line 68 to see if it passes doc test * minor wording changes * Changes related to last review Co-authored-by: Gilles Verbockhaven <[email protected]> Co-authored-by: gverbock <[email protected]>
1 parent 066a0e6 commit db83d76

File tree

14 files changed

+993
-75
lines changed

14 files changed

+993
-75
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ transforming parameters from the data and then transform it.
117117
* SelectByTargetMeanPerformance
118118
* RecursiveFeatureElimination
119119
* RecursiveFeatureAddition
120+
* DropHighPSIFeatures
120121

121122
### Preprocessing
122123
* MatchVariables
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
DropHighPSIFeatures
2+
===================
3+
4+
5+
.. autoclass:: feature_engine.selection.DropHighPSIFeatures
6+
:members:

docs/api_doc/selection/index.rst

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,18 @@ Feature Selection
66

77
Feature-engine's feature selection transformers are used to drop subsets of variables,
88
or in other words, to select subsets of variables. Feature-engine hosts selection
9-
algorithms that are in general, not available in other libraries. These algorithms have
9+
algorithms that are, in general, not available in other libraries. These algorithms have
1010
been gathered from data science competitions or used in the industry.
1111

12-
**Summary of Feature-engine's selectors main characteristics**
12+
Feature-engine's transformers select features based on 2 strategies. They either select
13+
features by looking at the features intrinsic characteristics, like distributions or their
14+
relationship with other features. Or they select features based on their impact on the
15+
machine learning model performance.
16+
17+
In the following tables you find the algorithms that belong to either category.
18+
19+
Selection based on feature characteristics
20+
------------------------------------------
1321

1422
============================================ ======================= ============= ====================================================================================
1523
Transformer Categorical variables Allows NA Description
@@ -19,13 +27,23 @@ been gathered from data science competitions or used in the industry.
1927
:class:`DropDuplicateFeatures()` √ √ Drops features that are duplicated
2028
:class:`DropCorrelatedFeatures()` × √ Drops features that are correlated
2129
:class:`SmartCorrelatedSelection()` × √ From a correlated feature group drops the less useful features
30+
:class:`DropHighPSIFeatures()` × √ Drops features with high Population Stability Index
31+
============================================ ======================= ============= ====================================================================================
32+
33+
Selection based on model performance
34+
------------------------------------
35+
36+
============================================ ======================= ============= ====================================================================================
37+
Transformer Categorical variables Allows NA Description
38+
============================================ ======================= ============= ====================================================================================
2239
:class:`SelectByShuffling()` × × Selects features if shuffling their values causes a drop in model performance
2340
:class:`SelectBySingleFeaturePerformance()` × × Removes observations with missing data from the dataset
2441
:class:`SelectByTargetMeanPerformance()` √ × Using the target mean as performance proxy, selects high performing features
2542
:class:`RecursiveFeatureElimination()` × × Removes features recursively by evaluating model performance
2643
:class:`RecursiveFeatureAddition()` × × Adds features recursively by evaluating model performance
2744
============================================ ======================= ============= ====================================================================================
2845

46+
2947
.. toctree::
3048
:maxdepth: 2
3149
:hidden:
@@ -35,21 +53,21 @@ been gathered from data science competitions or used in the industry.
3553
DropDuplicateFeatures
3654
DropCorrelatedFeatures
3755
SmartCorrelatedSelection
56+
DropHighPSIFeatures
3857
SelectByShuffling
3958
SelectBySingleFeaturePerformance
4059
SelectByTargetMeanPerformance
4160
RecursiveFeatureElimination
4261
RecursiveFeatureAddition
4362

44-
4563
Other Feature Selection Libraries
46-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
64+
---------------------------------
4765

4866
For additional feature selection algorithms visit the following open-source libraries:
4967

5068
* `Scikit-learn selection <https://scikit-learn.org/stable/modules/feature_selection.html>`_
5169
* `MLXtend selection <http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/>`_
5270

53-
Scikit-learn hosts multiple filter and embedded methods, that select features based on
71+
Scikit-learn hosts multiple filter and embedded methods that select features based on
5472
statistical tests or machine learning model derived importance. MLXtend hosts greedy
5573
(wrapper) feature selection methods.
129 KB
Loading
105 KB
Loading
182 KB
Loading
191 KB
Loading

docs/images/selectionChart.png

24 KB
Loading

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,12 +166,14 @@ Feature Selection:
166166
- :doc:`api_doc/selection/DropDuplicateFeatures`: drops duplicated variables from a dataframe
167167
- :doc:`api_doc/selection/DropCorrelatedFeatures`: drops correlated variables from a dataframe
168168
- :doc:`api_doc/selection/SmartCorrelatedSelection`: selects best features from correlated groups
169+
- :doc:`api_doc/selection/DropHighPSIFeatures`: selects features based on the Population Stability Index (PSI)
169170
- :doc:`api_doc/selection/SelectByShuffling`: selects features by evaluating model performance after feature shuffling
170171
- :doc:`api_doc/selection/SelectBySingleFeaturePerformance`: selects features based on their performance on univariate estimators
171172
- :doc:`api_doc/selection/SelectByTargetMeanPerformance`: selects features based on target mean encoding performance
172173
- :doc:`api_doc/selection/RecursiveFeatureElimination`: selects features recursively, by evaluating model performance
173174
- :doc:`api_doc/selection/RecursiveFeatureAddition`: selects features recursively, by evaluating model performance
174175

176+
175177
Preprocessing:
176178
--------------
177179

0 commit comments

Comments
 (0)