diff --git a/docs/images/quasiconstant.png b/docs/images/quasiconstant.png new file mode 100644 index 000000000..077a52049 Binary files /dev/null and b/docs/images/quasiconstant.png differ diff --git a/docs/user_guide/selection/DropConstantFeatures.rst b/docs/user_guide/selection/DropConstantFeatures.rst index 6fb4858b5..d58393466 100644 --- a/docs/user_guide/selection/DropConstantFeatures.rst +++ b/docs/user_guide/selection/DropConstantFeatures.rst @@ -5,19 +5,50 @@ DropConstantFeatures ==================== +Constant features are variables that show zero variability, or, in other words, have the +same value in all rows. A key step towards training a machine learning model is to identify +and remove constant features. -The :class:`DropConstantFeatures()` drops constant and quasi-constant variables from a dataframe. -By default, it drops only constant variables. Constant variables have a single -value. Quasi-constant variables have a single value in most of its observations. +Features with no or low variability rarely constitute useful predictors. Hence, removing +them right at the beginning of the data science project is a good way of simplifying your +dataset and subsequent data preprocessing pipelines. -This transformer works with numerical and categorical variables, and it offers a pretty straightforward -way of reducing the feature space. Be mindful though, that depending on the context, quasi-constant -variables could be useful. +Filter methods are selection algorithms that select or remove features based solely on +their characteristics. In this light, removing constant features could be considered part +of the filter group of selection algorithms. + +In Python, we can find constant features by using pandas `std` or `unique` methods, and then +remove them with `drop`. + +With Scikit-learn, we can find and remove constant variables with `VarianceThreshold` to quickly +reduce the number of features. `VarianceThreshold` is part of `sklearn.feature_selection`'s API. + +`VarianceThreshold`, however, would only work with numerical variables. Hence, we could only +evaluate categorical variables after encoding them, which requires a prior step of data +preprocessing just to remove redundant variables. + +Feature-engine introduces :class:`DropConstantFeatures()` to find and remove constant and +quasi-constant features from a dataframe. :class:`DropConstantFeatures()` works with numerical, +categorical, or datetime variables. It is therefore more versatile than Scikit-learn’s transformer +because it allows us to drop all duplicate variables without the need for prior data transformations. + +By default, :class:`DropConstantFeatures()` drops constant variables. We also have the option +to drop quasi-constant features, which are those that show mostly constant values and some other +values in a very small percentage of rows. + +Because :class:`DropConstantFeatures()` works with numerical and categorical variables alike, +it offers a straightforward way of reducing the feature subset. + +Be mindful, though, that depending on the context, quasi-constant variables could be useful. **Example** -Let's see how to use :class:`DropConstantFeatures()` in an example with the Titanic dataset. We -first load the data and separate it into train and test: +Let’s see how to use :class:`DropConstantFeatures()` by using the Titanic dataset. This dataset +does not contain constant or quasi-constant variables, so for the sake of the demonstration, +we will consider quasi-constant those features that show the same value in more than 70% of +the rows. + +We first load the data and separate it into a training set and a test set: .. code:: python @@ -36,7 +67,8 @@ first load the data and separate it into train and test: ) Now, we set up the :class:`DropConstantFeatures()` to remove features that show the same -value in more than 70% of the observations: +value in more than 70% of the observations. We do this through the parameter `tol`. The +default value for this parameter is zero, in which case it will remove constant features. .. code:: python @@ -62,8 +94,8 @@ The variables to drop are stored in the attribute `features_to_drop_`: ['parch', 'cabin', 'embarked', 'body'] -We see in the following code snippets that for the variables parch and embarked, more -than 70% of the observations displayed the same value: +We can check that the variables `parch` and `embarked` show the same value in more than 70% of +the observations as follows: .. code:: python @@ -78,7 +110,9 @@ than 70% of the observations displayed the same value: Name: embarked, dtype: float64 -71% of the passengers embarked in S. +Based on the previous results, 71% of the passengers embarked in S. + +Let's now evaluate `parch`: .. code:: python @@ -96,10 +130,24 @@ than 70% of the observations displayed the same value: 9 0.001092 Name: parch, dtype: float64 -77% of the passengers had 0 parent or child. Because of this, these features were -deemed constant and removed. +Based on the previous results, 77% of the passengers had 0 parent or child. Because of this, +these features were deemed quasi-constant and will be removed in the next step. + +We can also identify quasi-constant variables as follows: + +.. code:: python + + import pandas -With `transform()`, we can go ahead and drop the variables from the data: + X_train["embarked"].value_counts(normalize=True).plot.bar() + +After executing the previous code, we observe the following plot, with more than 70% of +passengers embarking in S: + +.. figure:: ../../images/quasiconstant.png + :align: center + +With `transform()`, we drop the quasi-constant variables from the dataset: .. code:: python @@ -133,16 +181,35 @@ We see the resulting dataframe below: 1193 Missing 686 Kingwilliamstown, Co Cork, Ireland Glens Falls... +Like sklearn, Feature-engine transformers have the `fit_transform` method that allows us +to find and remove constant or quasi-constant variables in a single line of code for convenience. + +Like sklearn as well, `DropConstantFeatures()` has the `get_support()` method, which returns +a vector with values `True` for features that will be retained and `False` for those that +will be dropped. + +.. code:: python + + transformer.get_support() + +.. code:: python + + [True, True, True, True, True, False, True, True, False, False, + True, False, True] + + +This and other feature selection methods may not necessarily avoid overfitting, but they +contribute to simplifying our machine learning pipelines and creating more interpretable +machine learning models. More details ^^^^^^^^^^^^ -In this Kaggle kernel we use :class:`DropConstantFeatures()` together with other feature selection algorithms: +In this Kaggle kernel we use :class:`DropConstantFeatures()` together with other feature +selection algorithms and then train a Logistic regression estimator: - `Kaggle kernel `_ -All notebooks can be found in a `dedicated repository `_. - For more details about this and other feature selection methods check out these resources: - `Feature selection for machine learning `_, online course. diff --git a/docs/user_guide/selection/DropDuplicateFeatures.rst b/docs/user_guide/selection/DropDuplicateFeatures.rst index afd718677..ca7b65bfe 100644 --- a/docs/user_guide/selection/DropDuplicateFeatures.rst +++ b/docs/user_guide/selection/DropDuplicateFeatures.rst @@ -5,18 +5,34 @@ DropDuplicateFeatures ===================== -The :class:`DropDuplicateFeatures()` finds and removes duplicated variables from a dataframe. -Duplicated features are identical features, regardless of the variable or column name. If -they show the same values for every observation, then they are considered duplicated. +Duplicate features are columns in a dataset that are identical, or, in other words, that +contain exactly the same values. Duplicate features can be introduced accidentally, either +through poor data management processes or during data manipulation. -The transformer will automatically evaluate all variables, or alternatively, you can pass a -list with the variables you wish to have examined. And it works with numerical and categorical -features. +For example, duplicated new records can be created by one-hot encoding a categorical +variable or by adding missing data indicators. We can also accidentally generate duplicate +records when we merge different data sources that show some variable overlap. + +Checking for and removing duplicate features is a standard procedure in any data analysis +workflow that helps us reduce the dimension of the dataset quickly and ensure data quality. +In Python, we can find duplicate values in an attribute table very easily with Pandas. +Dropping those duplicate features, however, requires a few more lines of code. + +Feature-engine aims to accelerate the process of data validation by finding and removing +duplicate features with the :class:`DropDuplicateFeatures()` class, which is part of the +selection API. + +:class:`DropDuplicateFeatures()` does exactly that; it finds and removes duplicated variables +from a dataframe. DropDuplicateFeatures() will automatically evaluate all variables, or +alternatively, you can pass a list with the variables you wish to have examined. And it +works with numerical and categorical features alike. + +So let’s see how to set up :class:`DropDuplicateFeatures()`. **Example** -Let's see how to use :class:`DropDuplicateFeatures()` in an example with the Titanic dataset. -These dataset does not have duplicated features, so we will add a few manually: +In this demo, we will use the Titanic dataset and introduce a few duplicated features +manually: .. code:: python @@ -36,6 +52,11 @@ These dataset does not have duplicated features, so we will add a few manually: 'sibsp', 'parch', 'fare','cabin', 'embarked', 'sex_dup', 'age_dup', 'sibsp_dup'] + +We then split the data into a training and a testing set: + +.. code:: python + # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['survived'], axis=1), @@ -64,7 +85,10 @@ Below we see the resulting data: 1193 male 29.881135 0 686 female 22.000000 0 -Now, we set up :class:`DropDuplicateFeatures()` to find the duplications: +As expected, the variables `sex` and `sex_dup` have duplicate field values throughout all +the rows. The same is true for the variables `age` and `age_dup`. + +Now, we set up :class:`DropDuplicateFeatures()` to find the duplicate features: .. code:: python @@ -76,7 +100,8 @@ With `fit()` the transformer finds the duplicated features: transformer.fit(X_train) -The features that are duplicated and will be removed are stored by the transformer: +The features that are duplicated and will be removed are stored in the `features_to_drop_` +attribute: .. code:: python @@ -93,8 +118,8 @@ With `transform()` we remove the duplicated variables: train_t = transformer.transform(X_train) test_t = transformer.transform(X_test) -If we examine the variable names of the transformed dataset, we see that the duplicated -features are not present: +We can go ahead and check the variables in the transformed dataset, and we will see that +the duplicated features are not there any more: .. code:: python @@ -104,8 +129,8 @@ features are not present: Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked'], dtype='object') -And the transformer also stores the groups of duplicated features, which could be useful -if we have groups where more than 2 features are identical. +The transformer also stores the groups of duplicated features, which is useful for data +analysis and validation. .. code:: python @@ -119,12 +144,11 @@ if we have groups where more than 2 features are identical. More details ^^^^^^^^^^^^ -In this Kaggle kernel we use :class:`DropDuplicateFeatures()` together with other feature selection algorithms: +In this Kaggle kernel we use :class:`DropDuplicateFeatures()` in a pipeline with other +feature selection algorithms: - `Kaggle kernel `_ -All notebooks can be found in a `dedicated repository `_. - For more details about this and other feature selection methods check out these resources: - `Feature selection for machine learning `_, online course.